1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX940 20 AMDGPU/AMDGPUAsmGFX10 21 AMDGPU/AMDGPUAsmGFX1011 22 AMDGPU/AMDGPUAsmGFX1013 23 AMDGPU/AMDGPUAsmGFX1030 24 AMDGPUModifierSyntax 25 AMDGPUOperandSyntax 26 AMDGPUInstructionSyntax 27 AMDGPUInstructionNotation 28 AMDGPUDwarfExtensionsForHeterogeneousDebugging 29 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack 30 31Introduction 32============ 33 34The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 35R600 family up until the current GCN families. It lives in the 36``llvm/lib/Target/AMDGPU`` directory. 37 38LLVM 39==== 40 41.. _amdgpu-target-triples: 42 43Target Triples 44-------------- 45 46Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 47to specify the target triple: 48 49 .. table:: AMDGPU Architectures 50 :name: amdgpu-architecture-table 51 52 ============ ============================================================== 53 Architecture Description 54 ============ ============================================================== 55 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 56 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 57 ============ ============================================================== 58 59 .. table:: AMDGPU Vendors 60 :name: amdgpu-vendor-table 61 62 ============ ============================================================== 63 Vendor Description 64 ============ ============================================================== 65 ``amd`` Can be used for all AMD GPU usage. 66 ``mesa3d`` Can be used if the OS is ``mesa3d``. 67 ============ ============================================================== 68 69 .. table:: AMDGPU Operating Systems 70 :name: amdgpu-os 71 72 ============== ============================================================ 73 OS Description 74 ============== ============================================================ 75 *<empty>* Defaults to the *unknown* OS. 76 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 77 such as: 78 79 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 80 loader on Linux. See *AMD ROCm Platform Release Notes* 81 [AMD-ROCm-Release-Notes]_ for supported hardware and 82 software. 83 - AMD's PAL runtime using the *pal-amdhsa* loader on 84 Windows. 85 86 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 87 runtime using the *pal-amdpal* loader on Windows and Linux 88 Pro. 89 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 90 3D runtime using the *mesa-mesa3d* loader on Linux. 91 ============== ============================================================ 92 93 .. table:: AMDGPU Environments 94 :name: amdgpu-environment-table 95 96 ============ ============================================================== 97 Environment Description 98 ============ ============================================================== 99 *<empty>* Default. 100 ============ ============================================================== 101 102.. _amdgpu-processors: 103 104Processors 105---------- 106 107Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 108specify the AMDGPU processor together with optional target features. See 109:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 110specific information. 111 112Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 113 114* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 115 116 117 .. table:: AMDGPU Processors 118 :name: amdgpu-processor-table 119 120 =========== =============== ============ ===== ================= =============== =============== ====================== 121 Processor Alternative Target dGPU/ Target Target OS Support Example 122 Processor Triple APU Features Properties *(see* Products 123 Architecture Supported `amdgpu-os`_ 124 *and 125 corresponding 126 runtime release 127 notes for 128 current 129 information and 130 level of 131 support)* 132 =========== =============== ============ ===== ================= =============== =============== ====================== 133 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 134 ----------------------------------------------------------------------------------------------------------------------- 135 ``r600`` ``r600`` dGPU - Does not 136 support 137 generic 138 address 139 space 140 ``r630`` ``r600`` dGPU - Does not 141 support 142 generic 143 address 144 space 145 ``rs880`` ``r600`` dGPU - Does not 146 support 147 generic 148 address 149 space 150 ``rv670`` ``r600`` dGPU - Does not 151 support 152 generic 153 address 154 space 155 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 156 ----------------------------------------------------------------------------------------------------------------------- 157 ``rv710`` ``r600`` dGPU - Does not 158 support 159 generic 160 address 161 space 162 ``rv730`` ``r600`` dGPU - Does not 163 support 164 generic 165 address 166 space 167 ``rv770`` ``r600`` dGPU - Does not 168 support 169 generic 170 address 171 space 172 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 173 ----------------------------------------------------------------------------------------------------------------------- 174 ``cedar`` ``r600`` dGPU - Does not 175 support 176 generic 177 address 178 space 179 ``cypress`` ``r600`` dGPU - Does not 180 support 181 generic 182 address 183 space 184 ``juniper`` ``r600`` dGPU - Does not 185 support 186 generic 187 address 188 space 189 ``redwood`` ``r600`` dGPU - Does not 190 support 191 generic 192 address 193 space 194 ``sumo`` ``r600`` dGPU - Does not 195 support 196 generic 197 address 198 space 199 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 200 ----------------------------------------------------------------------------------------------------------------------- 201 ``barts`` ``r600`` dGPU - Does not 202 support 203 generic 204 address 205 space 206 ``caicos`` ``r600`` dGPU - Does not 207 support 208 generic 209 address 210 space 211 ``cayman`` ``r600`` dGPU - Does not 212 support 213 generic 214 address 215 space 216 ``turks`` ``r600`` dGPU - Does not 217 support 218 generic 219 address 220 space 221 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 222 ----------------------------------------------------------------------------------------------------------------------- 223 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 224 support 225 generic 226 address 227 space 228 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 229 - ``verde`` support 230 generic 231 address 232 space 233 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 234 - ``oland`` support 235 generic 236 address 237 space 238 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 239 ----------------------------------------------------------------------------------------------------------------------- 240 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 241 flat - *pal-amdhsa* - A6 Pro-7050B 242 scratch - *pal-amdpal* - A8-7100 243 - A8 Pro-7150B 244 - A10-7300 245 - A10 Pro-7350B 246 - FX-7500 247 - A8-7200P 248 - A10-7400P 249 - FX-7600P 250 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 251 flat - *pal-amdhsa* - FirePro W9100 252 scratch - *pal-amdpal* - FirePro S9150 253 - FirePro S9170 254 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 255 flat - *pal-amdhsa* - Radeon R9 290x 256 scratch - *pal-amdpal* - Radeon R390 257 - Radeon R390x 258 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 259 - ``mullins`` flat - *pal-amdpal* - E1-2200 260 scratch - E1-2500 261 - E2-3000 262 - E2-3800 263 - A4-5000 264 - A4-5100 265 - A6-5200 266 - A4 Pro-3340B 267 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 268 flat - *pal-amdpal* - Radeon HD 8770 269 scratch - R7 260 270 - R7 260X 271 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 272 flat - *pal-amdpal* 273 scratch .. TODO:: 274 275 Add product 276 names. 277 278 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 279 ----------------------------------------------------------------------------------------------------------------------- 280 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 281 flat - *pal-amdhsa* - Pro A6-8500B 282 scratch - *pal-amdpal* - A8-8600P 283 - Pro A8-8600B 284 - FX-8800P 285 - Pro A12-8800B 286 - A10-8700P 287 - Pro A10-8700B 288 - A10-8780P 289 - A10-9600P 290 - A10-9630P 291 - A12-9700P 292 - A12-9730P 293 - FX-9800P 294 - FX-9830P 295 - E2-9010 296 - A6-9210 297 - A9-9410 298 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 299 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 300 scratch - *pal-amdpal* - Radeon R9 385 301 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 302 - *pal-amdhsa* - Radeon R9 Fury 303 - *pal-amdpal* - Radeon R9 FuryX 304 - Radeon Pro Duo 305 - FirePro S9300x2 306 - Radeon Instinct MI8 307 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 308 flat - *pal-amdhsa* - Radeon RX 480 309 scratch - *pal-amdpal* - Radeon Instinct MI6 310 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 311 flat - *pal-amdhsa* 312 scratch - *pal-amdpal* 313 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 314 flat - *pal-amdhsa* - FirePro S7100 315 scratch - *pal-amdpal* - FirePro W7100 316 - Mobile FirePro 317 M7170 318 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 319 flat - *pal-amdhsa* 320 scratch - *pal-amdpal* .. TODO:: 321 322 Add product 323 names. 324 325 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ 326 ----------------------------------------------------------------------------------------------------------------------- 327 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 328 flat - *pal-amdhsa* Frontier Edition 329 scratch - *pal-amdpal* - Radeon RX Vega 56 330 - Radeon RX Vega 64 331 - Radeon RX Vega 64 332 Liquid 333 - Radeon Instinct MI25 334 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 335 flat - *pal-amdhsa* - Ryzen 5 2400G 336 scratch - *pal-amdpal* 337 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 338 - *pal-amdhsa* 339 - *pal-amdpal* .. TODO:: 340 341 Add product 342 names. 343 344 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 345 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 346 scratch - *pal-amdpal* - Radeon VII 347 - Radeon Pro VII 348 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 349 - xnack - Absolute 350 flat 351 scratch 352 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 353 flat 354 scratch .. TODO:: 355 356 Add product 357 names. 358 359 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 360 - tgsplit flat 361 - xnack scratch .. TODO:: 362 - Packed 363 work-item Add product 364 IDs names. 365 366 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 367 flat - Ryzen 7 4700GE 368 scratch - Ryzen 5 4600G 369 - Ryzen 5 4600GE 370 - Ryzen 3 4300G 371 - Ryzen 3 4300GE 372 - Ryzen Pro 4000G 373 - Ryzen 7 Pro 4700G 374 - Ryzen 7 Pro 4750GE 375 - Ryzen 5 Pro 4650G 376 - Ryzen 5 Pro 4650GE 377 - Ryzen 3 Pro 4350G 378 - Ryzen 3 Pro 4350GE 379 380 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 381 - tgsplit flat 382 - xnack scratch .. TODO:: 383 - Packed 384 work-item Add product 385 IDs names. 386 387 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 388 ----------------------------------------------------------------------------------------------------------------------- 389 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 390 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 391 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 392 - Radeon Pro 5600M 393 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 394 - wavefrontsize64 - Absolute - *pal-amdhsa* 395 - xnack flat - *pal-amdpal* 396 scratch 397 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 398 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 399 - xnack scratch - *pal-amdpal* 400 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 401 - wavefrontsize64 flat - *pal-amdhsa* 402 - xnack scratch - *pal-amdpal* .. TODO:: 403 404 Add product 405 names. 406 407 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 408 ----------------------------------------------------------------------------------------------------------------------- 409 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 410 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 411 scratch - *pal-amdpal* - Radeon RX 6900 XT 412 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 413 - wavefrontsize64 flat - *pal-amdhsa* 414 scratch - *pal-amdpal* 415 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 416 - wavefrontsize64 flat - *pal-amdhsa* 417 scratch - *pal-amdpal* .. TODO:: 418 419 Add product 420 names. 421 422 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 423 - wavefrontsize64 flat 424 scratch .. TODO:: 425 426 Add product 427 names. 428 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 429 - wavefrontsize64 flat 430 scratch .. TODO:: 431 432 Add product 433 names. 434 435 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 436 - wavefrontsize64 flat 437 scratch .. TODO:: 438 Add product 439 names. 440 441 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 442 - wavefrontsize64 flat 443 scratch .. TODO:: 444 445 Add product 446 names. 447 448 **GCN GFX11** 449 ----------------------------------------------------------------------------------------------------------------------- 450 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA* 451 - wavefrontsize64 flat 452 scratch .. TODO:: 453 - Packed 454 work-item Add product 455 IDs names. 456 457 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA* 458 - wavefrontsize64 flat 459 scratch .. TODO:: 460 - Packed 461 work-item Add product 462 IDs names. 463 464 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA* 465 - wavefrontsize64 flat 466 scratch .. TODO:: 467 - Packed 468 work-item Add product 469 IDs names. 470 471 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA* 472 - wavefrontsize64 flat 473 scratch .. TODO:: 474 - Packed 475 work-item Add product 476 IDs names. 477 478 =========== =============== ============ ===== ================= =============== =============== ====================== 479 480.. _amdgpu-target-features: 481 482Target Features 483--------------- 484 485Target features control how code is generated to support certain 486processor specific features. Not all target features are supported by 487all processors. The runtime must ensure that the features supported by 488the device used to execute the code match the features enabled when 489generating the code. A mismatch of features may result in incorrect 490execution, or a reduction in performance. 491 492The target features supported by each processor is listed in 493:ref:`amdgpu-processor-table`. 494 495Target features are controlled by exactly one of the following Clang 496options: 497 498``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 499 500 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 501 optional components of the target ID. If omitted, the target feature has the 502 ``any`` value. See :ref:`amdgpu-target-id`. 503 504``-m[no-]<target-feature>`` 505 506 Target features not specified by the target ID are specified using a 507 separate option. These target features can have an ``on`` or ``off`` 508 value. ``on`` is specified by omitting the ``no-`` prefix, and 509 ``off`` is specified by including the ``no-`` prefix. The default 510 if not specified is ``off``. 511 512For example: 513 514``-mcpu=gfx908:xnack+`` 515 Enable the ``xnack`` feature. 516``-mcpu=gfx908:xnack-`` 517 Disable the ``xnack`` feature. 518``-mcumode`` 519 Enable the ``cumode`` feature. 520``-mno-cumode`` 521 Disable the ``cumode`` feature. 522 523 .. table:: AMDGPU Target Features 524 :name: amdgpu-target-features-table 525 526 =============== ============================ ================================================== 527 Target Feature Clang Option to Control Description 528 Name 529 =============== ============================ ================================================== 530 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 531 when generating code for kernels. When disabled 532 native WGP wavefront execution mode is used, 533 when enabled CU wavefront execution mode is used 534 (see :ref:`amdgpu-amdhsa-memory-model`). 535 536 sramecc - ``-mcpu`` If specified, generate code that can only be 537 - ``--offload-arch`` loaded and executed in a process that has a 538 matching setting for SRAMECC. 539 540 If not specified for code object V2 to V3, generate 541 code that can be loaded and executed in a process 542 with SRAMECC enabled. 543 544 If not specified for code object V4 or above, generate 545 code that can be loaded and executed in a process 546 with either setting of SRAMECC. 547 548 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 549 work-groups are launched in threadgroup split mode. 550 When enabled the waves of a work-group may be 551 launched in different CUs. 552 553 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 554 generating code for kernels. When disabled 555 native wavefront size 32 is used, when enabled 556 wavefront size 64 is used. 557 558 xnack - ``-mcpu`` If specified, generate code that can only be 559 - ``--offload-arch`` loaded and executed in a process that has a 560 matching setting for XNACK replay. 561 562 If not specified for code object V2 to V3, generate 563 code that can be loaded and executed in a process 564 with XNACK replay enabled. 565 566 If not specified for code object V4 or above, generate 567 code that can be loaded and executed in a process 568 with either setting of XNACK replay. 569 570 XNACK replay can be used for demand paging and 571 page migration. If enabled in the device, then if 572 a page fault occurs the code may execute 573 incorrectly unless generated with XNACK replay 574 enabled, or generated for code object V4 or above without 575 specifying XNACK replay. Executing code that was 576 generated with XNACK replay enabled, or generated 577 for code object V4 or above without specifying XNACK replay, 578 on a device that does not have XNACK replay 579 enabled will execute correctly but may be less 580 performant than code generated for XNACK replay 581 disabled. 582 =============== ============================ ================================================== 583 584.. _amdgpu-target-id: 585 586Target ID 587--------- 588 589AMDGPU supports target IDs. See `Clang Offload Bundler 590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 591description. The AMDGPU target specific information is: 592 593**processor** 594 Is an AMDGPU processor or alternative processor name specified in 595 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 596 the primary processor and alternative processor names. The canonical form 597 target ID only allow the primary processor name. 598 599**target-feature** 600 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 601 is supported by the processor. The target features supported by each processor 602 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 603 a target ID are marked as being controlled by ``-mcpu`` and 604 ``--offload-arch``. Each target feature must appear at most once in a target 605 ID. The non-canonical form target ID allows the target features to be 606 specified in any order. The canonical form target ID requires the target 607 features to be specified in alphabetic order. 608 609.. _amdgpu-target-id-v2-v3: 610 611Code Object V2 to V3 Target ID 612~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 613 614The target ID syntax for code object V2 to V3 is the same as defined in `Clang 615Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 616when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 617directive and the bundle entry ID. In those cases it has the following BNF 618syntax: 619 620.. code:: 621 622 <target-id> ::== <processor> ( "+" <target-feature> )* 623 624Where a target feature is omitted if *Off* and present if *On* or *Any*. 625 626.. note:: 627 628 The code object V2 to V3 cannot represent *Any* and treats it the same as 629 *On*. 630 631.. _amdgpu-embedding-bundled-objects: 632 633Embedding Bundled Code Objects 634------------------------------ 635 636AMDGPU supports the HIP and OpenMP languages that perform code object embedding 637as described in `Clang Offload Bundler 638<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 639 640.. note:: 641 642 The target ID syntax used for code object V2 to V3 for a bundle entry ID 643 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 644 645.. _amdgpu-address-spaces: 646 647Address Spaces 648-------------- 649 650The AMDGPU architecture supports a number of memory address spaces. The address 651space names use the OpenCL standard names, with some additions. 652 653The AMDGPU address spaces correspond to target architecture specific LLVM 654address space numbers used in LLVM IR. 655 656The AMDGPU address spaces are described in 657:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 658supported for the ``amdgcn`` target. 659 660 .. table:: AMDGPU Address Spaces 661 :name: amdgpu-address-spaces-table 662 663 ================================= =============== =========== ================ ======= ============================ 664 .. 64-Bit Process Address Space 665 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 666 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 667 Space Number Name Name Size 668 ================================= =============== =========== ================ ======= ============================ 669 Generic 0 flat flat 64 0x0000000000000000 670 Global 1 global global 64 0x0000000000000000 671 Region 2 N/A GDS 32 *not implemented for AMDHSA* 672 Local 3 group LDS 32 0xFFFFFFFF 673 Constant 4 constant *same as global* 64 0x0000000000000000 674 Private 5 private scratch 32 0xFFFFFFFF 675 Constant 32-bit 6 *TODO* 0x00000000 676 Buffer Fat Pointer (experimental) 7 *TODO* 677 ================================= =============== =========== ================ ======= ============================ 678 679**Generic** 680 The generic address space is supported unless the *Target Properties* column 681 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 682 space*. 683 684 The generic address space uses the hardware flat address support for two fixed 685 ranges of virtual addresses (the private and local apertures), that are 686 outside the range of addressable global memory, to map from a flat address to 687 a private or local address. This uses FLAT instructions that can take a flat 688 address and access global, private (scratch), and group (LDS) memory depending 689 on if the address is within one of the aperture ranges. 690 691 Flat access to scratch requires hardware aperture setup and setup in the 692 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 693 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 694 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 695 696 To convert between a private or group address space address (termed a segment 697 address) and a flat address the base address of the corresponding aperture 698 can be used. For GFX7-GFX8 these are available in the 699 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 700 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 701 GFX9-GFX11 the aperture base addresses are directly available as inline 702 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 703 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 704 aligned to 2^32 which makes it easier to convert from flat to segment or 705 segment to flat. 706 707 A global address space address has the same value when used as a flat address 708 so no conversion is needed. 709 710**Global and Constant** 711 The global and constant address spaces both use global virtual addresses, 712 which are the same virtual address space used by the CPU. However, some 713 virtual addresses may only be accessible to the CPU, some only accessible 714 by the GPU, and some by both. 715 716 Using the constant address space indicates that the data will not change 717 during the execution of the kernel. This allows scalar read instructions to 718 be used. As the constant address space could only be modified on the host 719 side, a generic pointer loaded from the constant address space is safe to be 720 assumed as a global pointer since only the device global memory is visible 721 and managed on the host side. The vector and scalar L1 caches are invalidated 722 of volatile data before each kernel dispatch execution to allow constant 723 memory to change values between kernel dispatches. 724 725**Region** 726 The region address space uses the hardware Global Data Store (GDS). All 727 wavefronts executing on the same device will access the same memory for any 728 given region address. However, the same region address accessed by wavefronts 729 executing on different devices will access different memory. It is higher 730 performance than global memory. It is allocated by the runtime. The data 731 store (DS) instructions can be used to access it. 732 733**Local** 734 The local address space uses the hardware Local Data Store (LDS) which is 735 automatically allocated when the hardware creates the wavefronts of a 736 work-group, and freed when all the wavefronts of a work-group have 737 terminated. All wavefronts belonging to the same work-group will access the 738 same memory for any given local address. However, the same local address 739 accessed by wavefronts belonging to different work-groups will access 740 different memory. It is higher performance than global memory. The data store 741 (DS) instructions can be used to access it. 742 743**Private** 744 The private address space uses the hardware scratch memory support which 745 automatically allocates memory when it creates a wavefront and frees it when 746 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 747 given private address will be different to the memory accessed by another lane 748 of the same or different wavefront for the same private address. 749 750 If a kernel dispatch uses scratch, then the hardware allocates memory from a 751 pool of backing memory allocated by the runtime for each wavefront. The lanes 752 of the wavefront access this using dword (4 byte) interleaving. The mapping 753 used from private address to backing memory address is: 754 755 ``wavefront-scratch-base + 756 ((private-address / 4) * wavefront-size * 4) + 757 (wavefront-lane-id * 4) + (private-address % 4)`` 758 759 If each lane of a wavefront accesses the same private address, the 760 interleaving results in adjacent dwords being accessed and hence requires 761 fewer cache lines to be fetched. 762 763 There are different ways that the wavefront scratch base address is 764 determined by a wavefront (see 765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 766 767 Scratch memory can be accessed in an interleaved manner using buffer 768 instructions with the scratch buffer descriptor and per wavefront scratch 769 offset, by the scratch instructions, or by flat instructions. Multi-dword 770 access is not supported except by flat and scratch instructions in 771 GFX9-GFX11. 772 773**Constant 32-bit** 774 *TODO* 775 776**Buffer Fat Pointer** 777 The buffer fat pointer is an experimental address space that is currently 778 unsupported in the backend. It exposes a non-integral pointer that is in 779 the future intended to support the modelling of 128-bit buffer descriptors 780 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 781 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 782 model the buffer descriptors used heavily in graphics workloads targeting 783 the backend. 784 785.. _amdgpu-memory-scopes: 786 787Memory Scopes 788------------- 789 790This section provides LLVM memory synchronization scopes supported by the AMDGPU 791backend memory model when the target triple OS is ``amdhsa`` (see 792:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 793 794The memory model supported is based on the HSA memory model [HSA]_ which is 795based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 796relation is transitive over the synchronizes-with relation independent of scope 797and synchronizes-with allows the memory scope instances to be inclusive (see 798table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 799 800This is different to the OpenCL [OpenCL]_ memory model which does not have scope 801inclusion and requires the memory scopes to exactly match. However, this 802is conservatively correct for OpenCL. 803 804 .. table:: AMDHSA LLVM Sync Scopes 805 :name: amdgpu-amdhsa-llvm-sync-scopes-table 806 807 ======================= =================================================== 808 LLVM Sync Scope Description 809 ======================= =================================================== 810 *none* The default: ``system``. 811 812 Synchronizes with, and participates in modification 813 and seq_cst total orderings with, other operations 814 (except image operations) for all address spaces 815 (except private, or generic that accesses private) 816 provided the other operation's sync scope is: 817 818 - ``system``. 819 - ``agent`` and executed by a thread on the same 820 agent. 821 - ``workgroup`` and executed by a thread in the 822 same work-group. 823 - ``wavefront`` and executed by a thread in the 824 same wavefront. 825 826 ``agent`` Synchronizes with, and participates in modification 827 and seq_cst total orderings with, other operations 828 (except image operations) for all address spaces 829 (except private, or generic that accesses private) 830 provided the other operation's sync scope is: 831 832 - ``system`` or ``agent`` and executed by a thread 833 on the same agent. 834 - ``workgroup`` and executed by a thread in the 835 same work-group. 836 - ``wavefront`` and executed by a thread in the 837 same wavefront. 838 839 ``workgroup`` Synchronizes with, and participates in modification 840 and seq_cst total orderings with, other operations 841 (except image operations) for all address spaces 842 (except private, or generic that accesses private) 843 provided the other operation's sync scope is: 844 845 - ``system``, ``agent`` or ``workgroup`` and 846 executed by a thread in the same work-group. 847 - ``wavefront`` and executed by a thread in the 848 same wavefront. 849 850 ``wavefront`` Synchronizes with, and participates in modification 851 and seq_cst total orderings with, other operations 852 (except image operations) for all address spaces 853 (except private, or generic that accesses private) 854 provided the other operation's sync scope is: 855 856 - ``system``, ``agent``, ``workgroup`` or 857 ``wavefront`` and executed by a thread in the 858 same wavefront. 859 860 ``singlethread`` Only synchronizes with and participates in 861 modification and seq_cst total orderings with, 862 other operations (except image operations) running 863 in the same thread for all address spaces (for 864 example, in signal handlers). 865 866 ``one-as`` Same as ``system`` but only synchronizes with other 867 operations within the same address space. 868 869 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 870 operations within the same address space. 871 872 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 873 other operations within the same address space. 874 875 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 876 other operations within the same address space. 877 878 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 879 other operations within the same address space. 880 ======================= =================================================== 881 882LLVM IR Intrinsics 883------------------ 884 885The AMDGPU backend implements the following LLVM IR intrinsics. 886 887*This section is WIP.* 888 889.. TODO:: 890 891 List AMDGPU intrinsics. 892 893LLVM IR Attributes 894------------------ 895 896The AMDGPU backend supports the following LLVM IR attributes. 897 898 .. table:: AMDGPU LLVM IR Attributes 899 :name: amdgpu-llvm-ir-attributes-table 900 901 ======================================= ========================================================== 902 LLVM Attribute Description 903 ======================================= ========================================================== 904 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 905 will be specified when the kernel is dispatched. Generated 906 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 907 The implied default value is 1,1024. 908 909 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 910 argument block size for the implicit arguments. This 911 varies by OS and language (for OpenCL see 912 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 913 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 914 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 915 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 916 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 917 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 918 execution unit. Generated by the ``amdgpu_waves_per_eu`` 919 CLANG attribute [CLANG-ATTR]_. This is an optimization hint, 920 and the backend may not be able to satisfy the request. If 921 the specified range is incompatible with the function's 922 "amdgpu-flat-work-group-size" value, the implied occupancy 923 bounds by the workgroup size takes precedence. 924 925 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 926 mode register to be set on entry. Overrides the default for 927 the calling convention. 928 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 929 the mode register to be set on entry. Overrides the default 930 for the calling convention. 931 932 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the 933 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this 934 attribute, or reached through a call site marked with this attribute, 935 the value returned by the intrinsic is undefined. The backend can 936 generally infer this during code generation, so typically there is no 937 benefit to frontends marking functions with this. 938 939 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the 940 llvm.amdgcn.workitem.id.y intrinsic. 941 942 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the 943 llvm.amdgcn.workitem.id.z intrinsic. 944 945 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the 946 llvm.amdgcn.workgroup.id.x intrinsic. 947 948 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the 949 llvm.amdgcn.workgroup.id.y intrinsic. 950 951 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the 952 llvm.amdgcn.workgroup.id.z intrinsic. 953 954 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the 955 llvm.amdgcn.dispatch.ptr intrinsic. 956 957 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the 958 llvm.amdgcn.implicitarg.ptr intrinsic. 959 960 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the 961 llvm.amdgcn.dispatch.id intrinsic. 962 963 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the 964 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint 965 attributes, the queue pointer may be required in situations where the 966 intrinsic call does not directly appear in the program. Some subtargets 967 require the queue pointer for to handle some addrspacecasts, as well 968 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and 969 llvm.debug intrinsics. 970 971 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 972 kernel argument that holds the pointer to the hostcall buffer. If this 973 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 974 975 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 976 kernel argument that holds the pointer to an initialized memory buffer 977 that conforms to the requirements of the malloc/free device library V1 978 version implementation. If this attribute is absent, then the 979 amdgpu-no-implicitarg-ptr is also removed. 980 981 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 982 kernel argument that holds the multigrid synchronization pointer. If this 983 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 984 ======================================= ========================================================== 985 986.. _amdgpu-elf-code-object: 987 988ELF Code Object 989=============== 990 991The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 992can be linked by ``lld`` to produce a standard ELF shared code object which can 993be loaded and executed on an AMDGPU target. 994 995.. _amdgpu-elf-header: 996 997Header 998------ 999 1000The AMDGPU backend uses the following ELF header: 1001 1002 .. table:: AMDGPU ELF Header 1003 :name: amdgpu-elf-header-table 1004 1005 ========================== =============================== 1006 Field Value 1007 ========================== =============================== 1008 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 1009 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 1010 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 1011 - ``ELFOSABI_AMDGPU_HSA`` 1012 - ``ELFOSABI_AMDGPU_PAL`` 1013 - ``ELFOSABI_AMDGPU_MESA3D`` 1014 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 1015 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 1016 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 1017 - ``ELFABIVERSION_AMDGPU_HSA_V5`` 1018 - ``ELFABIVERSION_AMDGPU_PAL`` 1019 - ``ELFABIVERSION_AMDGPU_MESA3D`` 1020 ``e_type`` - ``ET_REL`` 1021 - ``ET_DYN`` 1022 ``e_machine`` ``EM_AMDGPU`` 1023 ``e_entry`` 0 1024 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 1025 :ref:`amdgpu-elf-header-e_flags-table-v3`, 1026 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards` 1027 ========================== =============================== 1028 1029.. 1030 1031 .. table:: AMDGPU ELF Header Enumeration Values 1032 :name: amdgpu-elf-header-enumeration-values-table 1033 1034 =============================== ===== 1035 Name Value 1036 =============================== ===== 1037 ``EM_AMDGPU`` 224 1038 ``ELFOSABI_NONE`` 0 1039 ``ELFOSABI_AMDGPU_HSA`` 64 1040 ``ELFOSABI_AMDGPU_PAL`` 65 1041 ``ELFOSABI_AMDGPU_MESA3D`` 66 1042 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 1043 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 1044 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 1045 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3 1046 ``ELFABIVERSION_AMDGPU_PAL`` 0 1047 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 1048 =============================== ===== 1049 1050``e_ident[EI_CLASS]`` 1051 The ELF class is: 1052 1053 * ``ELFCLASS32`` for ``r600`` architecture. 1054 1055 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 1056 process address space applications. 1057 1058``e_ident[EI_DATA]`` 1059 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 1060 1061``e_ident[EI_OSABI]`` 1062 One of the following AMDGPU target architecture specific OS ABIs 1063 (see :ref:`amdgpu-os`): 1064 1065 * ``ELFOSABI_NONE`` for *unknown* OS. 1066 1067 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 1068 1069 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 1070 1071 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 1072 1073``e_ident[EI_ABIVERSION]`` 1074 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 1075 object conforms: 1076 1077 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 1078 runtime ABI for code object V2. Specify using the Clang option 1079 ``-mcode-object-version=2``. 1080 1081 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 1082 runtime ABI for code object V3. Specify using the Clang option 1083 ``-mcode-object-version=3``. 1084 1085 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 1086 runtime ABI for code object V4. Specify using the Clang option 1087 ``-mcode-object-version=4``. This is the default code object 1088 version if not specified. 1089 1090 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA 1091 runtime ABI for code object V5. Specify using the Clang option 1092 ``-mcode-object-version=5``. 1093 1094 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 1095 runtime ABI. 1096 1097 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 1098 3D runtime ABI. 1099 1100``e_type`` 1101 Can be one of the following values: 1102 1103 1104 ``ET_REL`` 1105 The type produced by the AMDGPU backend compiler as it is relocatable code 1106 object. 1107 1108 ``ET_DYN`` 1109 The type produced by the linker as it is a shared code object. 1110 1111 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 1112 1113``e_machine`` 1114 The value ``EM_AMDGPU`` is used for the machine for all processors supported 1115 by the ``r600`` and ``amdgcn`` architectures (see 1116 :ref:`amdgpu-processor-table`). The specific processor is specified in the 1117 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 1118 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1119 ``e_flags`` for code object V3 and above (see 1120 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1121 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`). 1122 1123``e_entry`` 1124 The entry point is 0 as the entry points for individual kernels must be 1125 selected in order to invoke them through AQL packets. 1126 1127``e_flags`` 1128 The AMDGPU backend uses the following ELF header flags: 1129 1130 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1131 :name: amdgpu-elf-header-e_flags-v2-table 1132 1133 ===================================== ===== ============================= 1134 Name Value Description 1135 ===================================== ===== ============================= 1136 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1137 target feature is 1138 enabled for all code 1139 contained in the code object. 1140 If the processor 1141 does not support the 1142 ``xnack`` target 1143 feature then must 1144 be 0. 1145 See 1146 :ref:`amdgpu-target-features`. 1147 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1148 handler is enabled for all 1149 code contained in the code 1150 object. If the processor 1151 does not support a trap 1152 handler then must be 0. 1153 See 1154 :ref:`amdgpu-target-features`. 1155 ===================================== ===== ============================= 1156 1157 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1158 :name: amdgpu-elf-header-e_flags-table-v3 1159 1160 ================================= ===== ============================= 1161 Name Value Description 1162 ================================= ===== ============================= 1163 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1164 mask for 1165 ``EF_AMDGPU_MACH_xxx`` values 1166 defined in 1167 :ref:`amdgpu-ef-amdgpu-mach-table`. 1168 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1169 target feature is 1170 enabled for all code 1171 contained in the code object. 1172 If the processor 1173 does not support the 1174 ``xnack`` target 1175 feature then must 1176 be 0. 1177 See 1178 :ref:`amdgpu-target-features`. 1179 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1180 target feature is 1181 enabled for all code 1182 contained in the code object. 1183 If the processor 1184 does not support the 1185 ``sramecc`` target 1186 feature then must 1187 be 0. 1188 See 1189 :ref:`amdgpu-target-features`. 1190 ================================= ===== ============================= 1191 1192 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After 1193 :name: amdgpu-elf-header-e_flags-table-v4-onwards 1194 1195 ============================================ ===== =================================== 1196 Name Value Description 1197 ============================================ ===== =================================== 1198 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1199 mask for 1200 ``EF_AMDGPU_MACH_xxx`` values 1201 defined in 1202 :ref:`amdgpu-ef-amdgpu-mach-table`. 1203 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1204 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1205 values. 1206 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1207 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1208 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1209 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1210 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1211 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1212 values. 1213 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1214 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1215 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1216 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1217 ============================================ ===== =================================== 1218 1219 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1220 :name: amdgpu-ef-amdgpu-mach-table 1221 1222 ==================================== ========== ============================= 1223 Name Value Description (see 1224 :ref:`amdgpu-processor-table`) 1225 ==================================== ========== ============================= 1226 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1227 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1228 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1229 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1230 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1231 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1232 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1233 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1234 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1235 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1236 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1237 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1238 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1239 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1240 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1241 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1242 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1243 *reserved* 0x011 - Reserved for ``r600`` 1244 0x01f architecture processors. 1245 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1246 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1247 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1248 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1249 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1250 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1251 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1252 *reserved* 0x027 Reserved. 1253 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1254 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1255 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1256 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1257 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1258 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1259 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1260 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1261 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1262 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1263 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1264 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1265 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1266 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1267 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1268 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1269 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1270 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1271 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1272 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1273 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1274 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 1275 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1276 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1277 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940`` 1278 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100`` 1279 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 1280 *reserved* 0x043 Reserved. 1281 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103`` 1282 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036`` 1283 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101`` 1284 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102`` 1285 ==================================== ========== ============================= 1286 1287Sections 1288-------- 1289 1290An AMDGPU target ELF code object has the standard ELF sections which include: 1291 1292 .. table:: AMDGPU ELF Sections 1293 :name: amdgpu-elf-sections-table 1294 1295 ================== ================ ================================= 1296 Name Type Attributes 1297 ================== ================ ================================= 1298 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1299 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1300 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1301 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1302 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1303 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1304 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1305 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1306 ``.note`` ``SHT_NOTE`` *none* 1307 ``.rela``\ *name* ``SHT_RELA`` *none* 1308 ``.rela.dyn`` ``SHT_RELA`` *none* 1309 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1310 ``.shstrtab`` ``SHT_STRTAB`` *none* 1311 ``.strtab`` ``SHT_STRTAB`` *none* 1312 ``.symtab`` ``SHT_SYMTAB`` *none* 1313 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1314 ================== ================ ================================= 1315 1316These sections have their standard meanings (see [ELF]_) and are only generated 1317if needed. 1318 1319``.debug``\ *\** 1320 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1321 information on the DWARF produced by the AMDGPU backend. 1322 1323``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1324 The standard sections used by a dynamic loader. 1325 1326``.note`` 1327 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1328 backend. 1329 1330``.rela``\ *name*, ``.rela.dyn`` 1331 For relocatable code objects, *name* is the name of the section that the 1332 relocation records apply. For example, ``.rela.text`` is the section name for 1333 relocation records associated with the ``.text`` section. 1334 1335 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1336 records from each of the relocatable code object's ``.rela``\ *name* sections. 1337 1338 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1339 the AMDGPU backend. 1340 1341``.text`` 1342 The executable machine code for the kernels and functions they call. Generated 1343 as position independent code. See :ref:`amdgpu-code-conventions` for 1344 information on conventions used in the isa generation. 1345 1346.. _amdgpu-note-records: 1347 1348Note Records 1349------------ 1350 1351The AMDGPU backend code object contains ELF note records in the ``.note`` 1352section. The set of generated notes and their semantics depend on the code 1353object version; see :ref:`amdgpu-note-records-v2` and 1354:ref:`amdgpu-note-records-v3-onwards`. 1355 1356As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1357must be generated after the ``name`` field to ensure the ``desc`` field is 4 1358byte aligned. In addition, minimal zero-byte padding must be generated to 1359ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1360field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1361alignment. 1362 1363.. _amdgpu-note-records-v2: 1364 1365Code Object V2 Note Records 1366~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1367 1368.. warning:: 1369 Code object V2 is not the default code object version emitted by 1370 this version of LLVM. 1371 1372The AMDGPU backend code object uses the following ELF note record in the 1373``.note`` section when compiling for code object V2. 1374 1375The note record vendor field is "AMD". 1376 1377Additional note records may be present, but any which are not documented here 1378are deprecated and should not be used. 1379 1380 .. table:: AMDGPU Code Object V2 ELF Note Records 1381 :name: amdgpu-elf-note-records-v2-table 1382 1383 ===== ===================================== ====================================== 1384 Name Type Description 1385 ===== ===================================== ====================================== 1386 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1387 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1388 Finalizer and not the LLVM compiler. 1389 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1390 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1391 YAML [YAML]_ textual format. 1392 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1393 ===== ===================================== ====================================== 1394 1395.. 1396 1397 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1398 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1399 1400 ===================================== ===== 1401 Name Value 1402 ===================================== ===== 1403 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1404 ``NT_AMD_HSA_HSAIL`` 2 1405 ``NT_AMD_HSA_ISA_VERSION`` 3 1406 *reserved* 4-9 1407 ``NT_AMD_HSA_METADATA`` 10 1408 ``NT_AMD_HSA_ISA_NAME`` 11 1409 ===================================== ===== 1410 1411``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1412 Specifies the code object version number. The description field has the 1413 following layout: 1414 1415 .. code:: c 1416 1417 struct amdgpu_hsa_note_code_object_version_s { 1418 uint32_t major_version; 1419 uint32_t minor_version; 1420 }; 1421 1422 The ``major_version`` has a value less than or equal to 2. 1423 1424``NT_AMD_HSA_HSAIL`` 1425 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1426 field has the following layout: 1427 1428 .. code:: c 1429 1430 struct amdgpu_hsa_note_hsail_s { 1431 uint32_t hsail_major_version; 1432 uint32_t hsail_minor_version; 1433 uint8_t profile; 1434 uint8_t machine_model; 1435 uint8_t default_float_round; 1436 }; 1437 1438``NT_AMD_HSA_ISA_VERSION`` 1439 Specifies the target ISA version. The description field has the following layout: 1440 1441 .. code:: c 1442 1443 struct amdgpu_hsa_note_isa_s { 1444 uint16_t vendor_name_size; 1445 uint16_t architecture_name_size; 1446 uint32_t major; 1447 uint32_t minor; 1448 uint32_t stepping; 1449 char vendor_and_architecture_name[1]; 1450 }; 1451 1452 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1453 vendor and architecture names respectively, including the NUL character. 1454 1455 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1456 vendor, immediately followed by the NUL terminated string for the 1457 architecture. 1458 1459 This note record is used by the HSA runtime loader. 1460 1461 Code object V2 only supports a limited number of processors and has fixed 1462 settings for target features. See 1463 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1464 processors and the corresponding target ID. In the table the note record ISA 1465 name is a concatenation of the vendor name, architecture name, major, minor, 1466 and stepping separated by a ":". 1467 1468 The target ID column shows the processor name and fixed target features used 1469 by the LLVM compiler. The LLVM compiler does not generate a 1470 ``NT_AMD_HSA_HSAIL`` note record. 1471 1472 A code object generated by the Finalizer also uses code object V2 and always 1473 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1474 ``sramecc`` target feature is as shown in 1475 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1476 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1477 bit. 1478 1479``NT_AMD_HSA_ISA_NAME`` 1480 Specifies the target ISA name as a non-NUL terminated string. 1481 1482 This note record is not used by the HSA runtime loader. 1483 1484 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1485 V2's limited support of processors and fixed settings for target features. 1486 1487 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1488 from the string to the corresponding target ID. If the ``xnack`` target 1489 feature is supported and enabled, the string produced by the LLVM compiler 1490 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1491 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1492 1493``NT_AMD_HSA_METADATA`` 1494 Specifies extensible metadata associated with the code objects executed on HSA 1495 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1496 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1497 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1498 metadata string. 1499 1500 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1501 :name: amdgpu-elf-note-record-supported_processors-v2-table 1502 1503 ===================== ========================== 1504 Note Record ISA Name Target ID 1505 ===================== ========================== 1506 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1507 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1508 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1509 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1510 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1511 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1512 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1513 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1514 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1515 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1516 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1517 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1518 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1519 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1520 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1521 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1522 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1523 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1524 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1525 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1526 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1527 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1528 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1529 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1530 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1531 ===================== ========================== 1532 1533.. _amdgpu-note-records-v3-onwards: 1534 1535Code Object V3 and Above Note Records 1536~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1537 1538The AMDGPU backend code object uses the following ELF note record in the 1539``.note`` section when compiling for code object V3 and above. 1540 1541The note record vendor field is "AMDGPU". 1542 1543Additional note records may be present, but any which are not documented here 1544are deprecated and should not be used. 1545 1546 .. table:: AMDGPU Code Object V3 and Above ELF Note Records 1547 :name: amdgpu-elf-note-records-table-v3-onwards 1548 1549 ======== ============================== ====================================== 1550 Name Type Description 1551 ======== ============================== ====================================== 1552 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1553 binary format. 1554 ======== ============================== ====================================== 1555 1556.. 1557 1558 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values 1559 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards 1560 1561 ============================== ===== 1562 Name Value 1563 ============================== ===== 1564 *reserved* 0-31 1565 ``NT_AMDGPU_METADATA`` 32 1566 ============================== ===== 1567 1568``NT_AMDGPU_METADATA`` 1569 Specifies extensible metadata associated with an AMDGPU code object. It is 1570 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1571 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 1572 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and 1573 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the 1574 ``amdhsa`` OS. 1575 1576.. _amdgpu-symbols: 1577 1578Symbols 1579------- 1580 1581Symbols include the following: 1582 1583 .. table:: AMDGPU ELF Symbols 1584 :name: amdgpu-elf-symbols-table 1585 1586 ===================== ================== ================ ================== 1587 Name Type Section Description 1588 ===================== ================== ================ ================== 1589 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1590 - ``.rodata`` 1591 - ``.bss`` 1592 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1593 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1594 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1595 ===================== ================== ================ ================== 1596 1597Global variable 1598 Global variables both used and defined by the compilation unit. 1599 1600 If the symbol is defined in the compilation unit then it is allocated in the 1601 appropriate section according to if it has initialized data or is readonly. 1602 1603 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1604 will resolve relocations using the definition provided by another code object 1605 or explicitly defined by the runtime. 1606 1607 If the symbol resides in local/group memory (LDS) then its section is the 1608 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1609 ``st_value`` field describes alignment requirements as it does for common 1610 symbols. 1611 1612 .. TODO:: 1613 1614 Add description of linked shared object symbols. Seems undefined symbols 1615 are marked as STT_NOTYPE. 1616 1617Kernel descriptor 1618 Every HSA kernel has an associated kernel descriptor. It is the address of the 1619 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1620 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1621 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1622 1623Kernel entry point 1624 Every HSA kernel also has a symbol for its machine code entry point. 1625 1626.. _amdgpu-relocation-records: 1627 1628Relocation Records 1629------------------ 1630 1631AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1632relocatable fields are: 1633 1634``word32`` 1635 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1636 alignment. These values use the same byte order as other word values in the 1637 AMDGPU architecture. 1638 1639``word64`` 1640 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1641 alignment. These values use the same byte order as other word values in the 1642 AMDGPU architecture. 1643 1644Following notations are used for specifying relocation calculations: 1645 1646**A** 1647 Represents the addend used to compute the value of the relocatable field. 1648 1649**G** 1650 Represents the offset into the global offset table at which the relocation 1651 entry's symbol will reside during execution. 1652 1653**GOT** 1654 Represents the address of the global offset table. 1655 1656**P** 1657 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1658 of the storage unit being relocated (computed using ``r_offset``). 1659 1660**S** 1661 Represents the value of the symbol whose index resides in the relocation 1662 entry. Relocations not using this must specify a symbol index of 1663 ``STN_UNDEF``. 1664 1665**B** 1666 Represents the base address of a loaded executable or shared object which is 1667 the difference between the ELF address and the actual load address. 1668 Relocations using this are only valid in executable or shared objects. 1669 1670The following relocation types are supported: 1671 1672 .. table:: AMDGPU ELF Relocation Records 1673 :name: amdgpu-elf-relocation-records-table 1674 1675 ========================== ======= ===== ========== ============================== 1676 Relocation Type Kind Value Field Calculation 1677 ========================== ======= ===== ========== ============================== 1678 ``R_AMDGPU_NONE`` 0 *none* *none* 1679 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1680 Dynamic 1681 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1682 Dynamic 1683 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1684 Dynamic 1685 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1686 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1687 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1688 Dynamic 1689 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1690 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1691 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1692 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1693 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1694 *reserved* 12 1695 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1696 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 1697 ========================== ======= ===== ========== ============================== 1698 1699``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1700the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1701 1702There is no current OS loader support for 32-bit programs and so 1703``R_AMDGPU_ABS32`` is not used. 1704 1705.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1706 1707Loaded Code Object Path Uniform Resource Identifier (URI) 1708--------------------------------------------------------- 1709 1710The AMD GPU code object loader represents the path of the ELF shared object from 1711which the code object was loaded as a textual Uniform Resource Identifier (URI). 1712Note that the code object is the in memory loaded relocated form of the ELF 1713shared object. Multiple code objects may be loaded at different memory 1714addresses in the same process from the same ELF shared object. 1715 1716The loaded code object path URI syntax is defined by the following BNF syntax: 1717 1718.. code:: 1719 1720 code_object_uri ::== file_uri | memory_uri 1721 file_uri ::== "file://" file_path [ range_specifier ] 1722 memory_uri ::== "memory://" process_id range_specifier 1723 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1724 file_path ::== URI_ENCODED_OS_FILE_PATH 1725 process_id ::== DECIMAL_NUMBER 1726 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1727 1728**number** 1729 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1730 and octal values by "0". 1731 1732**file_path** 1733 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1734 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1735 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1736 the path are separated by "/". 1737 1738**offset** 1739 Is a 0-based byte offset to the start of the code object. For a file URI, it 1740 is from the start of the file specified by the ``file_path``, and if omitted 1741 defaults to 0. For a memory URI, it is the memory address and is required. 1742 1743**size** 1744 Is the number of bytes in the code object. For a file URI, if omitted it 1745 defaults to the size of the file. It is required for a memory URI. 1746 1747**process_id** 1748 Is the identity of the process owning the memory. For Linux it is the C 1749 unsigned integral decimal literal for the process ID (PID). 1750 1751For example: 1752 1753.. code:: 1754 1755 file:///dir1/dir2/file1 1756 file:///dir3/dir4/file2#offset=0x2000&size=3000 1757 memory://1234#offset=0x20000&size=3000 1758 1759.. _amdgpu-dwarf-debug-information: 1760 1761DWARF Debug Information 1762======================= 1763 1764.. warning:: 1765 1766 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1767 is not currently fully implemented and is subject to change. 1768 1769AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1770:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1771object executable code and data to the source language constructs. It can be 1772used by tools such as debuggers and profilers. It uses features defined in 1773:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1774DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1775 1776This section defines the AMDGPU target architecture specific DWARF mappings. 1777 1778.. _amdgpu-dwarf-register-identifier: 1779 1780Register Identifier 1781------------------- 1782 1783This section defines the AMDGPU target architecture register numbers used in 1784DWARF operation expressions (see DWARF Version 5 section 2.5 and 1785:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1786instructions (see DWARF Version 5 section 6.4 and 1787:ref:`amdgpu-dwarf-call-frame-information`). 1788 1789A single code object can contain code for kernels that have different wavefront 1790sizes. The vector registers and some scalar registers are based on the wavefront 1791size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1792simplifies the consumer of the DWARF so that each register has a fixed size, 1793rather than being dynamic according to the wavefront size mode. Similarly, 1794distinct DWARF registers are defined for those registers that vary in size 1795according to the process address size. This allows a consumer to treat a 1796specific AMDGPU processor as a single architecture regardless of how it is 1797configured at run time. The compiler explicitly specifies the DWARF registers 1798that match the mode in which the code it is generating will be executed. 1799 1800DWARF registers are encoded as numbers, which are mapped to architecture 1801registers. The mapping for AMDGPU is defined in 1802:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1803mapping. 1804 1805.. table:: AMDGPU DWARF Register Mapping 1806 :name: amdgpu-dwarf-register-mapping-table 1807 1808 ============== ================= ======== ================================== 1809 DWARF Register AMDGPU Register Bit Size Description 1810 ============== ================= ======== ================================== 1811 0 PC_32 32 Program Counter (PC) when 1812 executing in a 32-bit process 1813 address space. Used in the CFI to 1814 describe the PC of the calling 1815 frame. 1816 1 EXEC_MASK_32 32 Execution Mask Register when 1817 executing in wavefront 32 mode. 1818 2-15 *Reserved* *Reserved for highly accessed 1819 registers using DWARF shortcut.* 1820 16 PC_64 64 Program Counter (PC) when 1821 executing in a 64-bit process 1822 address space. Used in the CFI to 1823 describe the PC of the calling 1824 frame. 1825 17 EXEC_MASK_64 64 Execution Mask Register when 1826 executing in wavefront 64 mode. 1827 18-31 *Reserved* *Reserved for highly accessed 1828 registers using DWARF shortcut.* 1829 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1830 Registers. 1831 96-127 *Reserved* *Reserved for frequently accessed 1832 registers using DWARF 1-byte ULEB.* 1833 128 STATUS 32 Status Register. 1834 129-511 *Reserved* *Reserved for future Scalar 1835 Architectural Registers.* 1836 512 VCC_32 32 Vector Condition Code Register 1837 when executing in wavefront 32 1838 mode. 1839 513-767 *Reserved* *Reserved for future Vector 1840 Architectural Registers when 1841 executing in wavefront 32 mode.* 1842 768 VCC_64 64 Vector Condition Code Register 1843 when executing in wavefront 64 1844 mode. 1845 769-1023 *Reserved* *Reserved for future Vector 1846 Architectural Registers when 1847 executing in wavefront 64 mode.* 1848 1024-1087 *Reserved* *Reserved for padding.* 1849 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1850 1130-1535 *Reserved* *Reserved for future Scalar 1851 General Purpose Registers.* 1852 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1853 when executing in wavefront 32 1854 mode. 1855 1792-2047 *Reserved* *Reserved for future Vector 1856 General Purpose Registers when 1857 executing in wavefront 32 mode.* 1858 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1859 when executing in wavefront 32 1860 mode. 1861 2304-2559 *Reserved* *Reserved for future Vector 1862 Accumulation Registers when 1863 executing in wavefront 32 mode.* 1864 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1865 when executing in wavefront 64 1866 mode. 1867 2816-3071 *Reserved* *Reserved for future Vector 1868 General Purpose Registers when 1869 executing in wavefront 64 mode.* 1870 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1871 when executing in wavefront 64 1872 mode. 1873 3328-3583 *Reserved* *Reserved for future Vector 1874 Accumulation Registers when 1875 executing in wavefront 64 mode.* 1876 ============== ================= ======== ================================== 1877 1878The vector registers are represented as the full size for the wavefront. They 1879are organized as consecutive dwords (32-bits), one per lane, with the dword at 1880the least significant bit position corresponding to lane 0 and so forth. DWARF 1881location expressions involving the ``DW_OP_LLVM_offset`` and 1882``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1883register corresponding to the lane that is executing the current thread of 1884execution in languages that are implemented using a SIMD or SIMT execution 1885model. 1886 1887If the wavefront size is 32 lanes then the wavefront 32 mode register 1888definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1889mode register definitions are used. Some AMDGPU targets support executing in 1890both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1891to the wavefront mode of the generated code will be used. 1892 1893If code is generated to execute in a 32-bit process address space, then the 189432-bit process address space register definitions are used. If code is generated 1895to execute in a 64-bit process address space, then the 64-bit process address 1896space register definitions are used. The ``amdgcn`` target only supports the 189764-bit process address space. 1898 1899.. _amdgpu-dwarf-address-class-identifier: 1900 1901Address Class Identifier 1902------------------------ 1903 1904The DWARF address class represents the source language memory space. See DWARF 1905Version 5 section 2.12 which is updated by the *DWARF Extensions For 1906Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1907 1908The DWARF address class mapping used for AMDGPU is defined in 1909:ref:`amdgpu-dwarf-address-class-mapping-table`. 1910 1911.. table:: AMDGPU DWARF Address Class Mapping 1912 :name: amdgpu-dwarf-address-class-mapping-table 1913 1914 ========================= ====== ================= 1915 DWARF AMDGPU 1916 -------------------------------- ----------------- 1917 Address Class Name Value Address Space 1918 ========================= ====== ================= 1919 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1920 ``DW_ADDR_LLVM_global`` 0x0001 Global 1921 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1922 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1923 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1924 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1925 ========================= ====== ================= 1926 1927The DWARF address class values defined in the *DWARF Extensions For 1928Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1929 1930In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1931available for use for the AMD extension for access to the hardware GDS memory 1932which is scratchpad memory allocated per device. 1933 1934For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1935address class of ``DW_ADDR_none`` is used. 1936 1937See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1938mapping of DWARF address classes to DWARF address spaces, including address size 1939and NULL value. 1940 1941.. _amdgpu-dwarf-address-space-identifier: 1942 1943Address Space Identifier 1944------------------------ 1945 1946DWARF address spaces correspond to target architecture specific linear 1947addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1948For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1949 1950The DWARF address space mapping used for AMDGPU is defined in 1951:ref:`amdgpu-dwarf-address-space-mapping-table`. 1952 1953.. table:: AMDGPU DWARF Address Space Mapping 1954 :name: amdgpu-dwarf-address-space-mapping-table 1955 1956 ======================================= ===== ======= ======== ================= ======================= 1957 DWARF AMDGPU Notes 1958 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1959 Address Space Name Value Address Bit Size Address Space 1960 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1961 .. 64-bit 32-bit 1962 process process 1963 address address 1964 space space 1965 ======================================= ===== ======= ======== ================= ======================= 1966 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1967 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1968 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1969 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1970 *Reserved* 0x04 1971 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1972 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1973 ======================================= ===== ======= ======== ================= ======================= 1974 1975See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1976including address size and NULL value. 1977 1978The ``DW_ASPACE_none`` address space is the default target architecture address 1979space used in DWARF operations that do not specify an address space. It 1980therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1981related operations can refer to addresses in the program code. 1982 1983The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1984specify the flat address space. If the address corresponds to an address in the 1985local address space, then it corresponds to the wavefront that is executing the 1986focused thread of execution. If the address corresponds to an address in the 1987private address space, then it corresponds to the lane that is executing the 1988focused thread of execution for languages that are implemented using a SIMD or 1989SIMT execution model. 1990 1991.. note:: 1992 1993 CUDA-like languages such as HIP that do not have address spaces in the 1994 language type system, but do allow variables to be allocated in different 1995 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1996 address space in the DWARF expression operations as the default address space 1997 is the global address space. 1998 1999The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 2000specify the local address space corresponding to the wavefront that is executing 2001the focused thread of execution. 2002 2003The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 2004to specify the private address space corresponding to the lane that is executing 2005the focused thread of execution for languages that are implemented using a SIMD 2006or SIMT execution model. 2007 2008The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 2009to specify the unswizzled private address space corresponding to the wavefront 2010that is executing the focused thread of execution. The wavefront view of private 2011memory is the per wavefront unswizzled backing memory layout defined in 2012:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 2013location for the backing memory of the wavefront (namely the address is not 2014offset by ``wavefront-scratch-base``). The following formula can be used to 2015convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 2016``DW_ASPACE_AMDGPU_private_wave`` address: 2017 2018:: 2019 2020 private-address-wavefront = 2021 ((private-address-lane / 4) * wavefront-size * 4) + 2022 (wavefront-lane-id * 4) + (private-address-lane % 4) 2023 2024If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 2025of the dwords for each lane starting with lane 0 is required, then this 2026simplifies to: 2027 2028:: 2029 2030 private-address-wavefront = 2031 private-address-lane * wavefront-size 2032 2033A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 2034complete spilled vector register back into a complete vector register in the 2035CFI. The frame pointer can be a private lane address which is dword aligned, 2036which can be shifted to multiply by the wavefront size, and then used to form a 2037private wavefront address that gives a location for a contiguous set of dwords, 2038one per lane, where the vector register dwords are spilled. The compiler knows 2039the wavefront size since it generates the code. Note that the type of the 2040address may have to be converted as the size of a 2041``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 2042``DW_ASPACE_AMDGPU_private_wave`` address. 2043 2044.. _amdgpu-dwarf-lane-identifier: 2045 2046Lane identifier 2047--------------- 2048 2049DWARF lane identifies specify a target architecture lane position for hardware 2050that executes in a SIMD or SIMT manner, and on which a source language maps its 2051threads of execution onto those lanes. The DWARF lane identifier is pushed by 2052the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 2053section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 2054section :ref:`amdgpu-dwarf-operation-expressions`. 2055 2056For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 2057wavefront. It is numbered from 0 to the wavefront size minus 1. 2058 2059Operation Expressions 2060--------------------- 2061 2062DWARF expressions are used to compute program values and the locations of 2063program objects. See DWARF Version 5 section 2.5 and 2064:ref:`amdgpu-dwarf-operation-expressions`. 2065 2066DWARF location descriptions describe how to access storage which includes memory 2067and registers. When accessing storage on AMDGPU, bytes are ordered with least 2068significant bytes first, and bits are ordered within bytes with least 2069significant bits first. 2070 2071For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 2072unwinding vector registers that are spilled under the execution mask to memory: 2073the zero-single location description is the vector register, and the one-single 2074location description is the spilled memory location description. The 2075``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 2076memory location description. 2077 2078In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 2079``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 2080controlled by the execution mask. An undefined location description together 2081with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 2082to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 2083 2084Debugger Information Entry Attributes 2085------------------------------------- 2086 2087This section describes how certain debugger information entry attributes are 2088used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1 2089which are updated by *DWARF Extensions For Heterogeneous Debugging* section 2090:ref:`amdgpu-dwarf-low-level-information` and 2091:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`. 2092 2093.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 2094 2095``DW_AT_LLVM_lane_pc`` 2096~~~~~~~~~~~~~~~~~~~~~~ 2097 2098For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 2099location of the separate lanes of a SIMT thread. 2100 2101If the lane is an active lane then this will be the same as the current program 2102location. 2103 2104If the lane is inactive, but was active on entry to the subprogram, then this is 2105the program location in the subprogram at which execution of the lane is 2106conceptual positioned. 2107 2108If the lane was not active on entry to the subprogram, then this will be the 2109undefined location. A client debugger can check if the lane is part of a valid 2110work-group by checking that the lane is in the range of the associated 2111work-group within the grid, accounting for partial work-groups. If it is not, 2112then the debugger can omit any information for the lane. Otherwise, the debugger 2113may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 2114calling subprogram until it finds a non-undefined location. Conceptually the 2115lane only has the call frames that it has a non-undefined 2116``DW_AT_LLVM_lane_pc``. 2117 2118The following example illustrates how the AMDGPU backend can generate a DWARF 2119location list expression for the nested ``IF/THEN/ELSE`` structures of the 2120following subprogram pseudo code for a target with 64 lanes per wavefront. 2121 2122.. code:: 2123 :number-lines: 2124 2125 SUBPROGRAM X 2126 BEGIN 2127 a; 2128 IF (c1) THEN 2129 b; 2130 IF (c2) THEN 2131 c; 2132 ELSE 2133 d; 2134 ENDIF 2135 e; 2136 ELSE 2137 f; 2138 ENDIF 2139 g; 2140 END 2141 2142The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2143execution mask (``EXEC``) to linearize the control flow. The condition is 2144evaluated to make a mask of the lanes for which the condition evaluates to true. 2145First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2146logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2147``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2148the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2149region the ``EXEC`` mask is restored to the value it had at the beginning of the 2150region. This is shown below. Other approaches are possible, but the basic 2151concept is the same. 2152 2153.. code:: 2154 :number-lines: 2155 2156 $lex_start: 2157 a; 2158 %1 = EXEC 2159 %2 = c1 2160 $lex_1_start: 2161 EXEC = %1 & %2 2162 $if_1_then: 2163 b; 2164 %3 = EXEC 2165 %4 = c2 2166 $lex_1_1_start: 2167 EXEC = %3 & %4 2168 $lex_1_1_then: 2169 c; 2170 EXEC = ~EXEC & %3 2171 $lex_1_1_else: 2172 d; 2173 EXEC = %3 2174 $lex_1_1_end: 2175 e; 2176 EXEC = ~EXEC & %1 2177 $lex_1_else: 2178 f; 2179 EXEC = %1 2180 $lex_1_end: 2181 g; 2182 $lex_end: 2183 2184To create the DWARF location list expression that defines the location 2185description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2186pseudo instruction can be used to annotate the linearized control flow. This can 2187be done by defining an artificial variable for the lane PC. The DWARF location 2188list expression created for it is used as the value of the 2189``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2190 2191A DWARF procedure is defined for each well nested structured control flow region 2192which provides the conceptual lane program location for a lane if it is not 2193active (namely it is divergent). The DWARF operation expression for each region 2194conceptually inherits the value of the immediately enclosing region and modifies 2195it according to the semantics of the region. 2196 2197For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2198the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2199region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2200region since the ``THEN`` region has completed. 2201 2202The lane PC artificial variable is assigned at each region transition. It uses 2203the immediately enclosing region's DWARF procedure to compute the program 2204location for each lane assuming they are divergent, and then modifies the result 2205by inserting the current program location for each lane that the ``EXEC`` mask 2206indicates is active. 2207 2208By having separate DWARF procedures for each region, they can be reused to 2209define the value for any nested region. This reduces the total size of the DWARF 2210operation expressions. 2211 2212The following provides an example using pseudo LLVM MIR. 2213 2214.. code:: 2215 :number-lines: 2216 2217 $lex_start: 2218 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2219 DW_AT_name = "__uint64"; 2220 DW_AT_byte_size = 8; 2221 DW_AT_encoding = DW_ATE_unsigned; 2222 ]; 2223 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2224 DW_AT_name = "__active_lane_pc"; 2225 DW_AT_location = [ 2226 DW_OP_regx PC; 2227 DW_OP_LLVM_extend 64, 64; 2228 DW_OP_regval_type EXEC, %uint_64; 2229 DW_OP_LLVM_select_bit_piece 64, 64; 2230 ]; 2231 ]; 2232 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2233 DW_AT_name = "__divergent_lane_pc"; 2234 DW_AT_location = [ 2235 DW_OP_LLVM_undefined; 2236 DW_OP_LLVM_extend 64, 64; 2237 ]; 2238 ]; 2239 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2240 DW_OP_call_ref %__divergent_lane_pc; 2241 DW_OP_call_ref %__active_lane_pc; 2242 ]; 2243 a; 2244 %1 = EXEC; 2245 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2246 %2 = c1; 2247 $lex_1_start: 2248 EXEC = %1 & %2; 2249 $lex_1_then: 2250 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2251 DW_AT_name = "__divergent_lane_pc_1_then"; 2252 DW_AT_location = DIExpression[ 2253 DW_OP_call_ref %__divergent_lane_pc; 2254 DW_OP_addrx &lex_1_start; 2255 DW_OP_stack_value; 2256 DW_OP_LLVM_extend 64, 64; 2257 DW_OP_call_ref %__lex_1_save_exec; 2258 DW_OP_deref_type 64, %__uint_64; 2259 DW_OP_LLVM_select_bit_piece 64, 64; 2260 ]; 2261 ]; 2262 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2263 DW_OP_call_ref %__divergent_lane_pc_1_then; 2264 DW_OP_call_ref %__active_lane_pc; 2265 ]; 2266 b; 2267 %3 = EXEC; 2268 DBG_VALUE %3, %__lex_1_1_save_exec; 2269 %4 = c2; 2270 $lex_1_1_start: 2271 EXEC = %3 & %4; 2272 $lex_1_1_then: 2273 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2274 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2275 DW_AT_location = DIExpression[ 2276 DW_OP_call_ref %__divergent_lane_pc_1_then; 2277 DW_OP_addrx &lex_1_1_start; 2278 DW_OP_stack_value; 2279 DW_OP_LLVM_extend 64, 64; 2280 DW_OP_call_ref %__lex_1_1_save_exec; 2281 DW_OP_deref_type 64, %__uint_64; 2282 DW_OP_LLVM_select_bit_piece 64, 64; 2283 ]; 2284 ]; 2285 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2286 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2287 DW_OP_call_ref %__active_lane_pc; 2288 ]; 2289 c; 2290 EXEC = ~EXEC & %3; 2291 $lex_1_1_else: 2292 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2293 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2294 DW_AT_location = DIExpression[ 2295 DW_OP_call_ref %__divergent_lane_pc_1_then; 2296 DW_OP_addrx &lex_1_1_end; 2297 DW_OP_stack_value; 2298 DW_OP_LLVM_extend 64, 64; 2299 DW_OP_call_ref %__lex_1_1_save_exec; 2300 DW_OP_deref_type 64, %__uint_64; 2301 DW_OP_LLVM_select_bit_piece 64, 64; 2302 ]; 2303 ]; 2304 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2305 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2306 DW_OP_call_ref %__active_lane_pc; 2307 ]; 2308 d; 2309 EXEC = %3; 2310 $lex_1_1_end: 2311 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2312 DW_OP_call_ref %__divergent_lane_pc; 2313 DW_OP_call_ref %__active_lane_pc; 2314 ]; 2315 e; 2316 EXEC = ~EXEC & %1; 2317 $lex_1_else: 2318 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2319 DW_AT_name = "__divergent_lane_pc_1_else"; 2320 DW_AT_location = DIExpression[ 2321 DW_OP_call_ref %__divergent_lane_pc; 2322 DW_OP_addrx &lex_1_end; 2323 DW_OP_stack_value; 2324 DW_OP_LLVM_extend 64, 64; 2325 DW_OP_call_ref %__lex_1_save_exec; 2326 DW_OP_deref_type 64, %__uint_64; 2327 DW_OP_LLVM_select_bit_piece 64, 64; 2328 ]; 2329 ]; 2330 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2331 DW_OP_call_ref %__divergent_lane_pc_1_else; 2332 DW_OP_call_ref %__active_lane_pc; 2333 ]; 2334 f; 2335 EXEC = %1; 2336 $lex_1_end: 2337 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2338 DW_OP_call_ref %__divergent_lane_pc; 2339 DW_OP_call_ref %__active_lane_pc; 2340 ]; 2341 g; 2342 $lex_end: 2343 2344The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2345that are active, with the current program location. 2346 2347Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2348the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2349instruction, location list entries will be created that describe where the 2350artificial variables are allocated at any given program location. The compiler 2351may allocate them to registers or spill them to memory. 2352 2353The DWARF procedures for each region use the values of the saved execution mask 2354artificial variables to only update the lanes that are active on entry to the 2355region. All other lanes retain the value of the enclosing region where they were 2356last active. If they were not active on entry to the subprogram, then will have 2357the undefined location description. 2358 2359Other structured control flow regions can be handled similarly. For example, 2360loops would set the divergent program location for the region at the end of the 2361loop. Any lanes active will be in the loop, and any lanes not active must have 2362exited the loop. 2363 2364An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2365``IF/THEN/ELSE`` regions. 2366 2367The DWARF procedures can use the active lane artificial variable described in 2368:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2369``EXEC`` mask in order to support whole or quad wavefront mode. 2370 2371.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2372 2373``DW_AT_LLVM_active_lane`` 2374~~~~~~~~~~~~~~~~~~~~~~~~~~ 2375 2376The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2377entry is used to specify the lanes that are conceptually active for a SIMT 2378thread. 2379 2380The execution mask may be modified to implement whole or quad wavefront mode 2381operations. For example, all lanes may need to temporarily be made active to 2382execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2383update it to enable the necessary lanes, perform the operations, and then 2384restore the ``EXEC`` mask from the saved value. While executing the whole 2385wavefront region, the conceptual execution mask is the saved value, not the 2386``EXEC`` value. 2387 2388This is handled by defining an artificial variable for the active lane mask. The 2389active lane mask artificial variable would be the actual ``EXEC`` mask for 2390normal regions, and the saved execution mask for regions where the mask is 2391temporarily updated. The location list expression created for this artificial 2392variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2393attribute. 2394 2395``DW_AT_LLVM_augmentation`` 2396~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2397 2398For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2399debugger information entry has the following value for the augmentation string: 2400 2401:: 2402 2403 [amdgpu:v0.0] 2404 2405The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2406extensions used in the DWARF of the compilation unit. The version number 2407conforms to [SEMVER]_. 2408 2409Call Frame Information 2410---------------------- 2411 2412DWARF Call Frame Information (CFI) describes how a consumer can virtually 2413*unwind* call frames in a running process or core dump. See DWARF Version 5 2414section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2415 2416For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2417 24181. ``augmentation`` string contains the following null-terminated UTF-8 string: 2419 2420 :: 2421 2422 [amd:v0.0] 2423 2424 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2425 extensions used in this CIE or to the FDEs that use it. The version number 2426 conforms to [SEMVER]_. 2427 24282. ``address_size`` for the ``Global`` address space is defined in 2429 :ref:`amdgpu-dwarf-address-space-identifier`. 2430 24313. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2432 24334. ``code_alignment_factor`` is 4 bytes. 2434 2435 .. TODO:: 2436 2437 Add to :ref:`amdgpu-processor-table` table. 2438 24395. ``data_alignment_factor`` is 4 bytes. 2440 2441 .. TODO:: 2442 2443 Add to :ref:`amdgpu-processor-table` table. 2444 24456. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2446 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2447 24487. ``initial_instructions`` Since a subprogram X with fewer registers can be 2449 called from subprogram Y that has more allocated, X will not change any of 2450 the extra registers as it cannot access them. Therefore, the default rule 2451 for all columns is ``same value``. 2452 2453For AMDGPU the register number follows the numbering defined in 2454:ref:`amdgpu-dwarf-register-identifier`. 2455 2456For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2457the return address to get the address of a byte within the call site 2458instructions. See DWARF Version 5 section 6.4.4. 2459 2460Accelerated Access 2461------------------ 2462 2463See DWARF Version 5 section 6.1. 2464 2465Lookup By Name Section Header 2466~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2467 2468See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2469 2470For AMDGPU the lookup by name section header table: 2471 2472``augmentation_string_size`` (uword) 2473 2474 Set to the length of the ``augmentation_string`` value which is always a 2475 multiple of 4. 2476 2477``augmentation_string`` (sequence of UTF-8 characters) 2478 2479 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2480 2481 :: 2482 2483 [amdgpu:v0.0] 2484 2485 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2486 extensions used in the DWARF of this index. The version number conforms to 2487 [SEMVER]_. 2488 2489 .. note:: 2490 2491 This is different to the DWARF Version 5 definition that requires the first 2492 4 characters to be the vendor ID. But this is consistent with the other 2493 augmentation strings and does allow multiple vendor contributions. However, 2494 backwards compatibility may be more desirable. 2495 2496Lookup By Address Section Header 2497~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2498 2499See DWARF Version 5 section 6.1.2. 2500 2501For AMDGPU the lookup by address section header table: 2502 2503``address_size`` (ubyte) 2504 2505 Match the address size for the ``Global`` address space defined in 2506 :ref:`amdgpu-dwarf-address-space-identifier`. 2507 2508``segment_selector_size`` (ubyte) 2509 2510 AMDGPU does not use a segment selector so this is 0. The entries in the 2511 ``.debug_aranges`` do not have a segment selector. 2512 2513Line Number Information 2514----------------------- 2515 2516See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2517 2518AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2519The instruction set must be obtained from the ELF file header ``e_flags`` field 2520in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2521<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2522 2523.. TODO:: 2524 2525 Should the ``isa`` state machine register be used to indicate if the code is 2526 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2527 2528For AMDGPU the line number program header fields have the following values (see 2529DWARF Version 5 section 6.2.4): 2530 2531``address_size`` (ubyte) 2532 Matches the address size for the ``Global`` address space defined in 2533 :ref:`amdgpu-dwarf-address-space-identifier`. 2534 2535``segment_selector_size`` (ubyte) 2536 AMDGPU does not use a segment selector so this is 0. 2537 2538``minimum_instruction_length`` (ubyte) 2539 For GFX9-GFX11 this is 4. 2540 2541``maximum_operations_per_instruction`` (ubyte) 2542 For GFX9-GFX11 this is 1. 2543 2544Source text for online-compiled programs (for example, those compiled by the 2545OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2546See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2547Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2548<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2549 2550The Clang option used to control source embedding in AMDGPU is defined in 2551:ref:`amdgpu-clang-debug-options-table`. 2552 2553 .. table:: AMDGPU Clang Debug Options 2554 :name: amdgpu-clang-debug-options-table 2555 2556 ==================== ================================================== 2557 Debug Flag Description 2558 ==================== ================================================== 2559 -g[no-]embed-source Enable/disable embedding source text in DWARF 2560 debug sections. Useful for environments where 2561 source cannot be written to disk, such as 2562 when performing online compilation. 2563 ==================== ================================================== 2564 2565For example: 2566 2567``-gembed-source`` 2568 Enable the embedded source. 2569 2570``-gno-embed-source`` 2571 Disable the embedded source. 2572 257332-Bit and 64-Bit DWARF Formats 2574------------------------------- 2575 2576See DWARF Version 5 section 7.4 and 2577:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2578 2579For AMDGPU: 2580 2581* For the ``amdgcn`` target architecture only the 64-bit process address space 2582 is supported. 2583 2584* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2585 the 32-bit DWARF format. 2586 2587Unit Headers 2588------------ 2589 2590For AMDGPU the following values apply for each of the unit headers described in 2591DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2592 2593``address_size`` (ubyte) 2594 Matches the address size for the ``Global`` address space defined in 2595 :ref:`amdgpu-dwarf-address-space-identifier`. 2596 2597.. _amdgpu-code-conventions: 2598 2599Code Conventions 2600================ 2601 2602This section provides code conventions used for each supported target triple OS 2603(see :ref:`amdgpu-target-triples`). 2604 2605AMDHSA 2606------ 2607 2608This section provides code conventions used when the target triple OS is 2609``amdhsa`` (see :ref:`amdgpu-target-triples`). 2610 2611.. _amdgpu-amdhsa-code-object-metadata: 2612 2613Code Object Metadata 2614~~~~~~~~~~~~~~~~~~~~ 2615 2616The code object metadata specifies extensible metadata associated with the code 2617objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2618encoding and semantics of this metadata depends on the code object version; see 2619:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2620:ref:`amdgpu-amdhsa-code-object-metadata-v3`, 2621:ref:`amdgpu-amdhsa-code-object-metadata-v4` and 2622:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 2623 2624Code object metadata is specified in a note record (see 2625:ref:`amdgpu-note-records`) and is required when the target triple OS is 2626``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2627information necessary to support the HSA compatible runtime kernel queries. For 2628example, the segment sizes needed in a dispatch packet. In addition, a 2629high-level language runtime may require other information to be included. For 2630example, the AMD OpenCL runtime records kernel argument information. 2631 2632.. _amdgpu-amdhsa-code-object-metadata-v2: 2633 2634Code Object V2 Metadata 2635+++++++++++++++++++++++ 2636 2637.. warning:: 2638 Code object V2 is not the default code object version emitted by this version 2639 of LLVM. 2640 2641Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2642(see :ref:`amdgpu-note-records-v2`). 2643 2644The metadata is specified as a YAML formatted string (see [YAML]_ and 2645:doc:`YamlIO`). 2646 2647.. TODO:: 2648 2649 Is the string null terminated? It probably should not if YAML allows it to 2650 contain null characters, otherwise it should be. 2651 2652The metadata is represented as a single YAML document comprised of the mapping 2653defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2654referenced tables. 2655 2656For boolean values, the string values of ``false`` and ``true`` are used for 2657false and true respectively. 2658 2659Additional information can be added to the mappings. To avoid conflicts, any 2660non-AMD key names should be prefixed by "*vendor-name*.". 2661 2662 .. table:: AMDHSA Code Object V2 Metadata Map 2663 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2664 2665 ========== ============== ========= ======================================= 2666 String Key Value Type Required? Description 2667 ========== ============== ========= ======================================= 2668 "Version" sequence of Required - The first integer is the major 2669 2 integers version. Currently 1. 2670 - The second integer is the minor 2671 version. Currently 0. 2672 "Printf" sequence of Each string is encoded information 2673 strings about a printf function call. The 2674 encoded information is organized as 2675 fields separated by colon (':'): 2676 2677 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2678 2679 where: 2680 2681 ``ID`` 2682 A 32-bit integer as a unique id for 2683 each printf function call 2684 2685 ``N`` 2686 A 32-bit integer equal to the number 2687 of arguments of printf function call 2688 minus 1 2689 2690 ``S[i]`` (where i = 0, 1, ... , N-1) 2691 32-bit integers for the size in bytes 2692 of the i-th FormatString argument of 2693 the printf function call 2694 2695 FormatString 2696 The format string passed to the 2697 printf function call. 2698 "Kernels" sequence of Required Sequence of the mappings for each 2699 mapping kernel in the code object. See 2700 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2701 for the definition of the mapping. 2702 ========== ============== ========= ======================================= 2703 2704.. 2705 2706 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2707 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2708 2709 ================= ============== ========= ================================ 2710 String Key Value Type Required? Description 2711 ================= ============== ========= ================================ 2712 "Name" string Required Source name of the kernel. 2713 "SymbolName" string Required Name of the kernel 2714 descriptor ELF symbol. 2715 "Language" string Source language of the kernel. 2716 Values include: 2717 2718 - "OpenCL C" 2719 - "OpenCL C++" 2720 - "HCC" 2721 - "OpenMP" 2722 2723 "LanguageVersion" sequence of - The first integer is the major 2724 2 integers version. 2725 - The second integer is the 2726 minor version. 2727 "Attrs" mapping Mapping of kernel attributes. 2728 See 2729 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2730 for the mapping definition. 2731 "Args" sequence of Sequence of mappings of the 2732 mapping kernel arguments. See 2733 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2734 for the definition of the mapping. 2735 "CodeProps" mapping Mapping of properties related to 2736 the kernel code. See 2737 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2738 for the mapping definition. 2739 ================= ============== ========= ================================ 2740 2741.. 2742 2743 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2744 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2745 2746 =================== ============== ========= ============================== 2747 String Key Value Type Required? Description 2748 =================== ============== ========= ============================== 2749 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2750 3 integers must be >=1 and the dispatch 2751 work-group size X, Y, Z must 2752 correspond to the specified 2753 values. Defaults to 0, 0, 0. 2754 2755 Corresponds to the OpenCL 2756 ``reqd_work_group_size`` 2757 attribute. 2758 "WorkGroupSizeHint" sequence of The dispatch work-group size 2759 3 integers X, Y, Z is likely to be the 2760 specified values. 2761 2762 Corresponds to the OpenCL 2763 ``work_group_size_hint`` 2764 attribute. 2765 "VecTypeHint" string The name of a scalar or vector 2766 type. 2767 2768 Corresponds to the OpenCL 2769 ``vec_type_hint`` attribute. 2770 2771 "RuntimeHandle" string The external symbol name 2772 associated with a kernel. 2773 OpenCL runtime allocates a 2774 global buffer for the symbol 2775 and saves the kernel's address 2776 to it, which is used for 2777 device side enqueueing. Only 2778 available for device side 2779 enqueued kernels. 2780 =================== ============== ========= ============================== 2781 2782.. 2783 2784 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2785 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2786 2787 ================= ============== ========= ================================ 2788 String Key Value Type Required? Description 2789 ================= ============== ========= ================================ 2790 "Name" string Kernel argument name. 2791 "TypeName" string Kernel argument type name. 2792 "Size" integer Required Kernel argument size in bytes. 2793 "Align" integer Required Kernel argument alignment in 2794 bytes. Must be a power of two. 2795 "ValueKind" string Required Kernel argument kind that 2796 specifies how to set up the 2797 corresponding argument. 2798 Values include: 2799 2800 "ByValue" 2801 The argument is copied 2802 directly into the kernarg. 2803 2804 "GlobalBuffer" 2805 A global address space pointer 2806 to the buffer data is passed 2807 in the kernarg. 2808 2809 "DynamicSharedPointer" 2810 A group address space pointer 2811 to dynamically allocated LDS 2812 is passed in the kernarg. 2813 2814 "Sampler" 2815 A global address space 2816 pointer to a S# is passed in 2817 the kernarg. 2818 2819 "Image" 2820 A global address space 2821 pointer to a T# is passed in 2822 the kernarg. 2823 2824 "Pipe" 2825 A global address space pointer 2826 to an OpenCL pipe is passed in 2827 the kernarg. 2828 2829 "Queue" 2830 A global address space pointer 2831 to an OpenCL device enqueue 2832 queue is passed in the 2833 kernarg. 2834 2835 "HiddenGlobalOffsetX" 2836 The OpenCL grid dispatch 2837 global offset for the X 2838 dimension is passed in the 2839 kernarg. 2840 2841 "HiddenGlobalOffsetY" 2842 The OpenCL grid dispatch 2843 global offset for the Y 2844 dimension is passed in the 2845 kernarg. 2846 2847 "HiddenGlobalOffsetZ" 2848 The OpenCL grid dispatch 2849 global offset for the Z 2850 dimension is passed in the 2851 kernarg. 2852 2853 "HiddenNone" 2854 An argument that is not used 2855 by the kernel. Space needs to 2856 be left for it, but it does 2857 not need to be set up. 2858 2859 "HiddenPrintfBuffer" 2860 A global address space pointer 2861 to the runtime printf buffer 2862 is passed in kernarg. Mutually 2863 exclusive with 2864 "HiddenHostcallBuffer". 2865 2866 "HiddenHostcallBuffer" 2867 A global address space pointer 2868 to the runtime hostcall buffer 2869 is passed in kernarg. Mutually 2870 exclusive with 2871 "HiddenPrintfBuffer". 2872 2873 "HiddenDefaultQueue" 2874 A global address space pointer 2875 to the OpenCL device enqueue 2876 queue that should be used by 2877 the kernel by default is 2878 passed in the kernarg. 2879 2880 "HiddenCompletionAction" 2881 A global address space pointer 2882 to help link enqueued kernels into 2883 the ancestor tree for determining 2884 when the parent kernel has finished. 2885 2886 "HiddenMultiGridSyncArg" 2887 A global address space pointer for 2888 multi-grid synchronization is 2889 passed in the kernarg. 2890 2891 "ValueType" string Unused and deprecated. This should no longer 2892 be emitted, but is accepted for compatibility. 2893 2894 2895 "PointeeAlign" integer Alignment in bytes of pointee 2896 type for pointer type kernel 2897 argument. Must be a power 2898 of 2. Only present if 2899 "ValueKind" is 2900 "DynamicSharedPointer". 2901 "AddrSpaceQual" string Kernel argument address space 2902 qualifier. Only present if 2903 "ValueKind" is "GlobalBuffer" or 2904 "DynamicSharedPointer". Values 2905 are: 2906 2907 - "Private" 2908 - "Global" 2909 - "Constant" 2910 - "Local" 2911 - "Generic" 2912 - "Region" 2913 2914 .. TODO:: 2915 2916 Is GlobalBuffer only Global 2917 or Constant? Is 2918 DynamicSharedPointer always 2919 Local? Can HCC allow Generic? 2920 How can Private or Region 2921 ever happen? 2922 2923 "AccQual" string Kernel argument access 2924 qualifier. Only present if 2925 "ValueKind" is "Image" or 2926 "Pipe". Values 2927 are: 2928 2929 - "ReadOnly" 2930 - "WriteOnly" 2931 - "ReadWrite" 2932 2933 .. TODO:: 2934 2935 Does this apply to 2936 GlobalBuffer? 2937 2938 "ActualAccQual" string The actual memory accesses 2939 performed by the kernel on the 2940 kernel argument. Only present if 2941 "ValueKind" is "GlobalBuffer", 2942 "Image", or "Pipe". This may be 2943 more restrictive than indicated 2944 by "AccQual" to reflect what the 2945 kernel actual does. If not 2946 present then the runtime must 2947 assume what is implied by 2948 "AccQual" and "IsConst". Values 2949 are: 2950 2951 - "ReadOnly" 2952 - "WriteOnly" 2953 - "ReadWrite" 2954 2955 "IsConst" boolean Indicates if the kernel argument 2956 is const qualified. Only present 2957 if "ValueKind" is 2958 "GlobalBuffer". 2959 2960 "IsRestrict" boolean Indicates if the kernel argument 2961 is restrict qualified. Only 2962 present if "ValueKind" is 2963 "GlobalBuffer". 2964 2965 "IsVolatile" boolean Indicates if the kernel argument 2966 is volatile qualified. Only 2967 present if "ValueKind" is 2968 "GlobalBuffer". 2969 2970 "IsPipe" boolean Indicates if the kernel argument 2971 is pipe qualified. Only present 2972 if "ValueKind" is "Pipe". 2973 2974 .. TODO:: 2975 2976 Can GlobalBuffer be pipe 2977 qualified? 2978 2979 ================= ============== ========= ================================ 2980 2981.. 2982 2983 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2984 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2985 2986 ============================ ============== ========= ===================== 2987 String Key Value Type Required? Description 2988 ============================ ============== ========= ===================== 2989 "KernargSegmentSize" integer Required The size in bytes of 2990 the kernarg segment 2991 that holds the values 2992 of the arguments to 2993 the kernel. 2994 "GroupSegmentFixedSize" integer Required The amount of group 2995 segment memory 2996 required by a 2997 work-group in 2998 bytes. This does not 2999 include any 3000 dynamically allocated 3001 group segment memory 3002 that may be added 3003 when the kernel is 3004 dispatched. 3005 "PrivateSegmentFixedSize" integer Required The amount of fixed 3006 private address space 3007 memory required for a 3008 work-item in 3009 bytes. If the kernel 3010 uses a dynamic call 3011 stack then additional 3012 space must be added 3013 to this value for the 3014 call stack. 3015 "KernargSegmentAlign" integer Required The maximum byte 3016 alignment of 3017 arguments in the 3018 kernarg segment. Must 3019 be a power of 2. 3020 "WavefrontSize" integer Required Wavefront size. Must 3021 be a power of 2. 3022 "NumSGPRs" integer Required Number of scalar 3023 registers used by a 3024 wavefront for 3025 GFX6-GFX11. This 3026 includes the special 3027 SGPRs for VCC, Flat 3028 Scratch (GFX7-GFX10) 3029 and XNACK (for 3030 GFX8-GFX10). It does 3031 not include the 16 3032 SGPR added if a trap 3033 handler is 3034 enabled. It is not 3035 rounded up to the 3036 allocation 3037 granularity. 3038 "NumVGPRs" integer Required Number of vector 3039 registers used by 3040 each work-item for 3041 GFX6-GFX11 3042 "MaxFlatWorkGroupSize" integer Required Maximum flat 3043 work-group size 3044 supported by the 3045 kernel in work-items. 3046 Must be >=1 and 3047 consistent with 3048 ReqdWorkGroupSize if 3049 not 0, 0, 0. 3050 "NumSpilledSGPRs" integer Number of stores from 3051 a scalar register to 3052 a register allocator 3053 created spill 3054 location. 3055 "NumSpilledVGPRs" integer Number of stores from 3056 a vector register to 3057 a register allocator 3058 created spill 3059 location. 3060 ============================ ============== ========= ===================== 3061 3062.. _amdgpu-amdhsa-code-object-metadata-v3: 3063 3064Code Object V3 Metadata 3065+++++++++++++++++++++++ 3066 3067.. warning:: 3068 Code object V3 is not the default code object version emitted by this version 3069 of LLVM. 3070 3071Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note 3072record (see :ref:`amdgpu-note-records-v3-onwards`). 3073 3074The metadata is represented as Message Pack formatted binary data (see 3075[MsgPack]_). The top level is a Message Pack map that includes the 3076keys defined in table 3077:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 3078tables. 3079 3080Additional information can be added to the maps. To avoid conflicts, 3081any key names should be prefixed by "*vendor-name*." where 3082``vendor-name`` can be the name of the vendor and specific vendor 3083tool that generates the information. The prefix is abbreviated to 3084simply "." when it appears within a map that has been added by the 3085same *vendor-name*. 3086 3087 .. table:: AMDHSA Code Object V3 Metadata Map 3088 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 3089 3090 ================= ============== ========= ======================================= 3091 String Key Value Type Required? Description 3092 ================= ============== ========= ======================================= 3093 "amdhsa.version" sequence of Required - The first integer is the major 3094 2 integers version. Currently 1. 3095 - The second integer is the minor 3096 version. Currently 0. 3097 "amdhsa.printf" sequence of Each string is encoded information 3098 strings about a printf function call. The 3099 encoded information is organized as 3100 fields separated by colon (':'): 3101 3102 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 3103 3104 where: 3105 3106 ``ID`` 3107 A 32-bit integer as a unique id for 3108 each printf function call 3109 3110 ``N`` 3111 A 32-bit integer equal to the number 3112 of arguments of printf function call 3113 minus 1 3114 3115 ``S[i]`` (where i = 0, 1, ... , N-1) 3116 32-bit integers for the size in bytes 3117 of the i-th FormatString argument of 3118 the printf function call 3119 3120 FormatString 3121 The format string passed to the 3122 printf function call. 3123 "amdhsa.kernels" sequence of Required Sequence of the maps for each 3124 map kernel in the code object. See 3125 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 3126 for the definition of the keys included 3127 in that map. 3128 ================= ============== ========= ======================================= 3129 3130.. 3131 3132 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 3133 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 3134 3135 =================================== ============== ========= ================================ 3136 String Key Value Type Required? Description 3137 =================================== ============== ========= ================================ 3138 ".name" string Required Source name of the kernel. 3139 ".symbol" string Required Name of the kernel 3140 descriptor ELF symbol. 3141 ".language" string Source language of the kernel. 3142 Values include: 3143 3144 - "OpenCL C" 3145 - "OpenCL C++" 3146 - "HCC" 3147 - "HIP" 3148 - "OpenMP" 3149 - "Assembler" 3150 3151 ".language_version" sequence of - The first integer is the major 3152 2 integers version. 3153 - The second integer is the 3154 minor version. 3155 ".args" sequence of Sequence of maps of the 3156 map kernel arguments. See 3157 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3158 for the definition of the keys 3159 included in that map. 3160 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3161 3 integers must be >=1 and the dispatch 3162 work-group size X, Y, Z must 3163 correspond to the specified 3164 values. Defaults to 0, 0, 0. 3165 3166 Corresponds to the OpenCL 3167 ``reqd_work_group_size`` 3168 attribute. 3169 ".workgroup_size_hint" sequence of The dispatch work-group size 3170 3 integers X, Y, Z is likely to be the 3171 specified values. 3172 3173 Corresponds to the OpenCL 3174 ``work_group_size_hint`` 3175 attribute. 3176 ".vec_type_hint" string The name of a scalar or vector 3177 type. 3178 3179 Corresponds to the OpenCL 3180 ``vec_type_hint`` attribute. 3181 3182 ".device_enqueue_symbol" string The external symbol name 3183 associated with a kernel. 3184 OpenCL runtime allocates a 3185 global buffer for the symbol 3186 and saves the kernel's address 3187 to it, which is used for 3188 device side enqueueing. Only 3189 available for device side 3190 enqueued kernels. 3191 ".kernarg_segment_size" integer Required The size in bytes of 3192 the kernarg segment 3193 that holds the values 3194 of the arguments to 3195 the kernel. 3196 ".group_segment_fixed_size" integer Required The amount of group 3197 segment memory 3198 required by a 3199 work-group in 3200 bytes. This does not 3201 include any 3202 dynamically allocated 3203 group segment memory 3204 that may be added 3205 when the kernel is 3206 dispatched. 3207 ".private_segment_fixed_size" integer Required The amount of fixed 3208 private address space 3209 memory required for a 3210 work-item in 3211 bytes. If the kernel 3212 uses a dynamic call 3213 stack then additional 3214 space must be added 3215 to this value for the 3216 call stack. 3217 ".kernarg_segment_align" integer Required The maximum byte 3218 alignment of 3219 arguments in the 3220 kernarg segment. Must 3221 be a power of 2. 3222 ".uses_dynamic_stack" boolean Indicates if the generated 3223 machine code is using a 3224 dynamically sized stack. 3225 ".wavefront_size" integer Required Wavefront size. Must 3226 be a power of 2. 3227 ".sgpr_count" integer Required Number of scalar 3228 registers required by a 3229 wavefront for 3230 GFX6-GFX9. A register 3231 is required if it is 3232 used explicitly, or 3233 if a higher numbered 3234 register is used 3235 explicitly. This 3236 includes the special 3237 SGPRs for VCC, Flat 3238 Scratch (GFX7-GFX9) 3239 and XNACK (for 3240 GFX8-GFX9). It does 3241 not include the 16 3242 SGPR added if a trap 3243 handler is 3244 enabled. It is not 3245 rounded up to the 3246 allocation 3247 granularity. 3248 ".vgpr_count" integer Required Number of vector 3249 registers required by 3250 each work-item for 3251 GFX6-GFX9. A register 3252 is required if it is 3253 used explicitly, or 3254 if a higher numbered 3255 register is used 3256 explicitly. 3257 ".agpr_count" integer Required Number of accumulator 3258 registers required by 3259 each work-item for 3260 GFX90A, GFX908. 3261 ".max_flat_workgroup_size" integer Required Maximum flat 3262 work-group size 3263 supported by the 3264 kernel in work-items. 3265 Must be >=1 and 3266 consistent with 3267 ReqdWorkGroupSize if 3268 not 0, 0, 0. 3269 ".sgpr_spill_count" integer Number of stores from 3270 a scalar register to 3271 a register allocator 3272 created spill 3273 location. 3274 ".vgpr_spill_count" integer Number of stores from 3275 a vector register to 3276 a register allocator 3277 created spill 3278 location. 3279 ".kind" string The kind of the kernel 3280 with the following 3281 values: 3282 3283 "normal" 3284 Regular kernels. 3285 3286 "init" 3287 These kernels must be 3288 invoked after loading 3289 the containing code 3290 object and must 3291 complete before any 3292 normal and fini 3293 kernels in the same 3294 code object are 3295 invoked. 3296 3297 "fini" 3298 These kernels must be 3299 invoked before 3300 unloading the 3301 containing code object 3302 and after all init and 3303 normal kernels in the 3304 same code object have 3305 been invoked and 3306 completed. 3307 3308 If omitted, "normal" is 3309 assumed. 3310 =================================== ============== ========= ================================ 3311 3312.. 3313 3314 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3315 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3316 3317 ====================== ============== ========= ================================ 3318 String Key Value Type Required? Description 3319 ====================== ============== ========= ================================ 3320 ".name" string Kernel argument name. 3321 ".type_name" string Kernel argument type name. 3322 ".size" integer Required Kernel argument size in bytes. 3323 ".offset" integer Required Kernel argument offset in 3324 bytes. The offset must be a 3325 multiple of the alignment 3326 required by the argument. 3327 ".value_kind" string Required Kernel argument kind that 3328 specifies how to set up the 3329 corresponding argument. 3330 Values include: 3331 3332 "by_value" 3333 The argument is copied 3334 directly into the kernarg. 3335 3336 "global_buffer" 3337 A global address space pointer 3338 to the buffer data is passed 3339 in the kernarg. 3340 3341 "dynamic_shared_pointer" 3342 A group address space pointer 3343 to dynamically allocated LDS 3344 is passed in the kernarg. 3345 3346 "sampler" 3347 A global address space 3348 pointer to a S# is passed in 3349 the kernarg. 3350 3351 "image" 3352 A global address space 3353 pointer to a T# is passed in 3354 the kernarg. 3355 3356 "pipe" 3357 A global address space pointer 3358 to an OpenCL pipe is passed in 3359 the kernarg. 3360 3361 "queue" 3362 A global address space pointer 3363 to an OpenCL device enqueue 3364 queue is passed in the 3365 kernarg. 3366 3367 "hidden_global_offset_x" 3368 The OpenCL grid dispatch 3369 global offset for the X 3370 dimension is passed in the 3371 kernarg. 3372 3373 "hidden_global_offset_y" 3374 The OpenCL grid dispatch 3375 global offset for the Y 3376 dimension is passed in the 3377 kernarg. 3378 3379 "hidden_global_offset_z" 3380 The OpenCL grid dispatch 3381 global offset for the Z 3382 dimension is passed in the 3383 kernarg. 3384 3385 "hidden_none" 3386 An argument that is not used 3387 by the kernel. Space needs to 3388 be left for it, but it does 3389 not need to be set up. 3390 3391 "hidden_printf_buffer" 3392 A global address space pointer 3393 to the runtime printf buffer 3394 is passed in kernarg. Mutually 3395 exclusive with 3396 "hidden_hostcall_buffer" 3397 before Code Object V5. 3398 3399 "hidden_hostcall_buffer" 3400 A global address space pointer 3401 to the runtime hostcall buffer 3402 is passed in kernarg. Mutually 3403 exclusive with 3404 "hidden_printf_buffer" 3405 before Code Object V5. 3406 3407 "hidden_default_queue" 3408 A global address space pointer 3409 to the OpenCL device enqueue 3410 queue that should be used by 3411 the kernel by default is 3412 passed in the kernarg. 3413 3414 "hidden_completion_action" 3415 A global address space pointer 3416 to help link enqueued kernels into 3417 the ancestor tree for determining 3418 when the parent kernel has finished. 3419 3420 "hidden_multigrid_sync_arg" 3421 A global address space pointer for 3422 multi-grid synchronization is 3423 passed in the kernarg. 3424 3425 ".value_type" string Unused and deprecated. This should no longer 3426 be emitted, but is accepted for compatibility. 3427 3428 ".pointee_align" integer Alignment in bytes of pointee 3429 type for pointer type kernel 3430 argument. Must be a power 3431 of 2. Only present if 3432 ".value_kind" is 3433 "dynamic_shared_pointer". 3434 ".address_space" string Kernel argument address space 3435 qualifier. Only present if 3436 ".value_kind" is "global_buffer" or 3437 "dynamic_shared_pointer". Values 3438 are: 3439 3440 - "private" 3441 - "global" 3442 - "constant" 3443 - "local" 3444 - "generic" 3445 - "region" 3446 3447 .. TODO:: 3448 3449 Is "global_buffer" only "global" 3450 or "constant"? Is 3451 "dynamic_shared_pointer" always 3452 "local"? Can HCC allow "generic"? 3453 How can "private" or "region" 3454 ever happen? 3455 3456 ".access" string Kernel argument access 3457 qualifier. Only present if 3458 ".value_kind" is "image" or 3459 "pipe". Values 3460 are: 3461 3462 - "read_only" 3463 - "write_only" 3464 - "read_write" 3465 3466 .. TODO:: 3467 3468 Does this apply to 3469 "global_buffer"? 3470 3471 ".actual_access" string The actual memory accesses 3472 performed by the kernel on the 3473 kernel argument. Only present if 3474 ".value_kind" is "global_buffer", 3475 "image", or "pipe". This may be 3476 more restrictive than indicated 3477 by ".access" to reflect what the 3478 kernel actual does. If not 3479 present then the runtime must 3480 assume what is implied by 3481 ".access" and ".is_const" . Values 3482 are: 3483 3484 - "read_only" 3485 - "write_only" 3486 - "read_write" 3487 3488 ".is_const" boolean Indicates if the kernel argument 3489 is const qualified. Only present 3490 if ".value_kind" is 3491 "global_buffer". 3492 3493 ".is_restrict" boolean Indicates if the kernel argument 3494 is restrict qualified. Only 3495 present if ".value_kind" is 3496 "global_buffer". 3497 3498 ".is_volatile" boolean Indicates if the kernel argument 3499 is volatile qualified. Only 3500 present if ".value_kind" is 3501 "global_buffer". 3502 3503 ".is_pipe" boolean Indicates if the kernel argument 3504 is pipe qualified. Only present 3505 if ".value_kind" is "pipe". 3506 3507 .. TODO:: 3508 3509 Can "global_buffer" be pipe 3510 qualified? 3511 3512 ====================== ============== ========= ================================ 3513 3514.. _amdgpu-amdhsa-code-object-metadata-v4: 3515 3516Code Object V4 Metadata 3517+++++++++++++++++++++++ 3518 3519Code object V4 metadata is the same as 3520:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3521defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`. 3522 3523 .. table:: AMDHSA Code Object V4 Metadata Map Changes 3524 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3525 3526 ================= ============== ========= ======================================= 3527 String Key Value Type Required? Description 3528 ================= ============== ========= ======================================= 3529 "amdhsa.version" sequence of Required - The first integer is the major 3530 2 integers version. Currently 1. 3531 - The second integer is the minor 3532 version. Currently 1. 3533 "amdhsa.target" string Required The target name of the code using the syntax: 3534 3535 .. code:: 3536 3537 <target-triple> [ "-" <target-id> ] 3538 3539 A canonical target ID must be 3540 used. See :ref:`amdgpu-target-triples` 3541 and :ref:`amdgpu-target-id`. 3542 ================= ============== ========= ======================================= 3543 3544.. _amdgpu-amdhsa-code-object-metadata-v5: 3545 3546Code Object V5 Metadata 3547+++++++++++++++++++++++ 3548 3549.. warning:: 3550 Code object V5 is not the default code object version emitted by this version 3551 of LLVM. 3552 3553 3554Code object V5 metadata is the same as 3555:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table 3556:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table 3557:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`. 3558 3559 .. table:: AMDHSA Code Object V5 Metadata Map Changes 3560 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5 3561 3562 ================= ============== ========= ======================================= 3563 String Key Value Type Required? Description 3564 ================= ============== ========= ======================================= 3565 "amdhsa.version" sequence of Required - The first integer is the major 3566 2 integers version. Currently 1. 3567 - The second integer is the minor 3568 version. Currently 2. 3569 ================= ============== ========= ======================================= 3570 3571.. 3572 3573 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes 3574 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5 3575 3576 ====================== ============== ========= ================================ 3577 String Key Value Type Required? Description 3578 ====================== ============== ========= ================================ 3579 ".value_kind" string Required Kernel argument kind that 3580 specifies how to set up the 3581 corresponding argument. 3582 Values include: 3583 the same as code object V3 metadata 3584 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`) 3585 with the following additions: 3586 3587 "hidden_block_count_x" 3588 The grid dispatch work-group count for the X dimension 3589 is passed in the kernarg. Some languages, such as OpenCL, 3590 support a last work-group in each dimension being partial. 3591 This count only includes the non-partial work-group count. 3592 This is not the same as the value in the AQL dispatch packet, 3593 which has the grid size in work-items. 3594 3595 "hidden_block_count_y" 3596 The grid dispatch work-group count for the Y dimension 3597 is passed in the kernarg. Some languages, such as OpenCL, 3598 support a last work-group in each dimension being partial. 3599 This count only includes the non-partial work-group count. 3600 This is not the same as the value in the AQL dispatch packet, 3601 which has the grid size in work-items. If the grid dimensionality 3602 is 1, then must be 1. 3603 3604 "hidden_block_count_z" 3605 The grid dispatch work-group count for the Z dimension 3606 is passed in the kernarg. Some languages, such as OpenCL, 3607 support a last work-group in each dimension being partial. 3608 This count only includes the non-partial work-group count. 3609 This is not the same as the value in the AQL dispatch packet, 3610 which has the grid size in work-items. If the grid dimensionality 3611 is 1 or 2, then must be 1. 3612 3613 "hidden_group_size_x" 3614 The grid dispatch work-group size for the X dimension is 3615 passed in the kernarg. This size only applies to the 3616 non-partial work-groups. This is the same value as the AQL 3617 dispatch packet work-group size. 3618 3619 "hidden_group_size_y" 3620 The grid dispatch work-group size for the Y dimension is 3621 passed in the kernarg. This size only applies to the 3622 non-partial work-groups. This is the same value as the AQL 3623 dispatch packet work-group size. If the grid dimensionality 3624 is 1, then must be 1. 3625 3626 "hidden_group_size_z" 3627 The grid dispatch work-group size for the Z dimension is 3628 passed in the kernarg. This size only applies to the 3629 non-partial work-groups. This is the same value as the AQL 3630 dispatch packet work-group size. If the grid dimensionality 3631 is 1 or 2, then must be 1. 3632 3633 "hidden_remainder_x" 3634 The grid dispatch work group size of the partial work group 3635 of the X dimension, if it exists. Must be zero if a partial 3636 work group does not exist in the X dimension. 3637 3638 "hidden_remainder_y" 3639 The grid dispatch work group size of the partial work group 3640 of the Y dimension, if it exists. Must be zero if a partial 3641 work group does not exist in the Y dimension. 3642 3643 "hidden_remainder_z" 3644 The grid dispatch work group size of the partial work group 3645 of the Z dimension, if it exists. Must be zero if a partial 3646 work group does not exist in the Z dimension. 3647 3648 "hidden_grid_dims" 3649 The grid dispatch dimensionality. This is the same value 3650 as the AQL dispatch packet dimensionality. Must be a value 3651 between 1 and 3. 3652 3653 "hidden_heap_v1" 3654 A global address space pointer to an initialized memory 3655 buffer that conforms to the requirements of the malloc/free 3656 device library V1 version implementation. 3657 3658 "hidden_private_base" 3659 The high 32 bits of the flat addressing private aperture base. 3660 Only used by GFX8 to allow conversion between private segment 3661 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3662 3663 "hidden_shared_base" 3664 The high 32 bits of the flat addressing shared aperture base. 3665 Only used by GFX8 to allow conversion between shared segment 3666 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3667 3668 "hidden_queue_ptr" 3669 A global memory address space pointer to the ROCm runtime 3670 ``struct amd_queue_t`` structure for the HSA queue of the 3671 associated dispatch AQL packet. It is only required for pre-GFX9 3672 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`). 3673 3674 ====================== ============== ========= ================================ 3675 3676.. 3677 3678Kernel Dispatch 3679~~~~~~~~~~~~~~~ 3680 3681The HSA architected queuing language (AQL) defines a user space memory interface 3682that can be used to control the dispatch of kernels, in an agent independent 3683way. An agent can have zero or more AQL queues created for it using an HSA 3684compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3685are 64 bytes) can be placed. See the *HSA Platform System Architecture 3686Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3687 3688The packet processor of a kernel agent is responsible for detecting and 3689dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3690packet processor is implemented by the hardware command processor (CP), 3691asynchronous dispatch controller (ADC) and shader processor input controller 3692(SPI). 3693 3694An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3695the kernel mode driver to initialize and register the AQL queue with CP. 3696 3697To dispatch a kernel the following actions are performed. This can occur in the 3698CPU host program, or from an HSA kernel executing on a GPU. 3699 37001. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3701 executed is obtained. 37022. A pointer to the kernel descriptor (see 3703 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3704 It must be for a kernel that is contained in a code object that was loaded 3705 by an HSA compatible runtime on the kernel agent with which the AQL queue is 3706 associated. 37073. Space is allocated for the kernel arguments using the HSA compatible runtime 3708 allocator for a memory region with the kernarg property for the kernel agent 3709 that will execute the kernel. It must be at least 16-byte aligned. 37104. Kernel argument values are assigned to the kernel argument memory 3711 allocation. The layout is defined in the *HSA Programmer's Language 3712 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3713 kernel argument memory in the same way constant memory is accessed. (Note 3714 that the HSA specification allows an implementation to copy the kernel 3715 argument contents to another location that is accessed by the kernel.) 37165. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3717 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3718 for the packet. The packet must be set up, and the final write must use an 3719 atomic store release to set the packet kind to ensure the packet contents are 3720 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3721 notify the kernel agent that the AQL queue has been updated. These rules, and 3722 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3723 System Architecture Specification* [HSA]_. 37246. A kernel dispatch packet includes information about the actual dispatch, 3725 such as grid and work-group size, together with information from the code 3726 object about the kernel, such as segment sizes. The HSA compatible runtime 3727 queries on the kernel symbol can be used to obtain the code object values 3728 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 37297. CP executes micro-code and is responsible for detecting and setting up the 3730 GPU to execute the wavefronts of a kernel dispatch. 37318. CP ensures that when the a wavefront starts executing the kernel machine 3732 code, the scalar general purpose registers (SGPR) and vector general purpose 3733 registers (VGPR) are set up as required by the machine code. The required 3734 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3735 register state is defined in 3736 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 37379. The prolog of the kernel machine code (see 3738 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3739 before continuing executing the machine code that corresponds to the kernel. 374010. When the kernel dispatch has completed execution, CP signals the completion 3741 signal specified in the kernel dispatch packet if not 0. 3742 3743.. _amdgpu-amdhsa-memory-spaces: 3744 3745Memory Spaces 3746~~~~~~~~~~~~~ 3747 3748The memory space properties are: 3749 3750 .. table:: AMDHSA Memory Spaces 3751 :name: amdgpu-amdhsa-memory-spaces-table 3752 3753 ================= =========== ======== ======= ================== 3754 Memory Space Name HSA Segment Hardware Address NULL Value 3755 Name Name Size 3756 ================= =========== ======== ======= ================== 3757 Private private scratch 32 0x00000000 3758 Local group LDS 32 0xFFFFFFFF 3759 Global global global 64 0x0000000000000000 3760 Constant constant *same as 64 0x0000000000000000 3761 global* 3762 Generic flat flat 64 0x0000000000000000 3763 Region N/A GDS 32 *not implemented 3764 for AMDHSA* 3765 ================= =========== ======== ======= ================== 3766 3767The global and constant memory spaces both use global virtual addresses, which 3768are the same virtual address space used by the CPU. However, some virtual 3769addresses may only be accessible to the CPU, some only accessible by the GPU, 3770and some by both. 3771 3772Using the constant memory space indicates that the data will not change during 3773the execution of the kernel. This allows scalar read instructions to be 3774used. The vector and scalar L1 caches are invalidated of volatile data before 3775each kernel dispatch execution to allow constant memory to change values between 3776kernel dispatches. 3777 3778The local memory space uses the hardware Local Data Store (LDS) which is 3779automatically allocated when the hardware creates work-groups of wavefronts, and 3780freed when all the wavefronts of a work-group have terminated. The data store 3781(DS) instructions can be used to access it. 3782 3783The private memory space uses the hardware scratch memory support. If the kernel 3784uses scratch, then the hardware allocates memory that is accessed using 3785wavefront lane dword (4 byte) interleaving. The mapping used from private 3786address to physical address is: 3787 3788 ``wavefront-scratch-base + 3789 (private-address * wavefront-size * 4) + 3790 (wavefront-lane-id * 4)`` 3791 3792There are different ways that the wavefront scratch base address is determined 3793by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3794memory can be accessed in an interleaved manner using buffer instruction with 3795the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3796instructions, or by flat instructions. If each lane of a wavefront accesses the 3797same private address, the interleaving results in adjacent dwords being accessed 3798and hence requires fewer cache lines to be fetched. Multi-dword access is not 3799supported except by flat and scratch instructions in GFX9-GFX11. 3800 3801The generic address space uses the hardware flat address support available in 3802GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and 3803local apertures), that are outside the range of addressible global memory, to 3804map from a flat address to a private or local address. 3805 3806FLAT instructions can take a flat address and access global, private (scratch) 3807and group (LDS) memory depending on if the address is within one of the 3808aperture ranges. Flat access to scratch requires hardware aperture setup and 3809setup in the kernel prologue (see 3810:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3811hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3812:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3813 3814To convert between a segment address and a flat address the base address of the 3815apertures address can be used. For GFX7-GFX8 these are available in the 3816:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3817Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3818GFX9-GFX11 the aperture base addresses are directly available as inline constant 3819registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3820address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3821which makes it easier to convert from flat to segment or segment to flat. 3822 3823Image and Samplers 3824~~~~~~~~~~~~~~~~~~ 3825 3826Image and sample handles created by an HSA compatible runtime (see 3827:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3828object respectively. In order to support the HSA ``query_sampler`` operations 3829two extra dwords are used to store the HSA BRIG enumeration values for the 3830queries that are not trivially deducible from the S# representation. 3831 3832HSA Signals 3833~~~~~~~~~~~ 3834 3835HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3836are 64-bit addresses of a structure allocated in memory accessible from both the 3837CPU and GPU. The structure is defined by the runtime and subject to change 3838between releases. For example, see [AMD-ROCm-github]_. 3839 3840.. _amdgpu-amdhsa-hsa-aql-queue: 3841 3842HSA AQL Queue 3843~~~~~~~~~~~~~ 3844 3845The HSA AQL queue structure is defined by an HSA compatible runtime (see 3846:ref:`amdgpu-os`) and subject to change between releases. For example, see 3847[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3848certain language features such as the flat address aperture bases. It also 3849contains fields used by CP such as managing the allocation of scratch memory. 3850 3851.. _amdgpu-amdhsa-kernel-descriptor: 3852 3853Kernel Descriptor 3854~~~~~~~~~~~~~~~~~ 3855 3856A kernel descriptor consists of the information needed by CP to initiate the 3857execution of a kernel, including the entry point address of the machine code 3858that implements the kernel. 3859 3860Code Object V3 Kernel Descriptor 3861++++++++++++++++++++++++++++++++ 3862 3863CP microcode requires the Kernel descriptor to be allocated on 64-byte 3864alignment. 3865 3866The fields used by CP for code objects before V3 also match those specified in 3867:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3868 3869 .. table:: Code Object V3 Kernel Descriptor 3870 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3871 3872 ======= ======= =============================== ============================ 3873 Bits Size Field Name Description 3874 ======= ======= =============================== ============================ 3875 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3876 address space memory 3877 required for a work-group 3878 in bytes. This does not 3879 include any dynamically 3880 allocated local address 3881 space memory that may be 3882 added when the kernel is 3883 dispatched. 3884 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3885 private address space 3886 memory required for a 3887 work-item in bytes. 3888 Additional space may need to 3889 be added to this value if 3890 the call stack has 3891 non-inlined function calls. 3892 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3893 memory pointed to by the 3894 AQL dispatch packet. The 3895 kernarg memory is used to 3896 pass arguments to the 3897 kernel. 3898 3899 * If the kernarg pointer in 3900 the dispatch packet is NULL 3901 then there are no kernel 3902 arguments. 3903 * If the kernarg pointer in 3904 the dispatch packet is 3905 not NULL and this value 3906 is 0 then the kernarg 3907 memory size is 3908 unspecified. 3909 * If the kernarg pointer in 3910 the dispatch packet is 3911 not NULL and this value 3912 is not 0 then the value 3913 specifies the kernarg 3914 memory size in bytes. It 3915 is recommended to provide 3916 a value as it may be used 3917 by CP to optimize making 3918 the kernarg memory 3919 visible to the kernel 3920 code. 3921 3922 127:96 4 bytes Reserved, must be 0. 3923 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3924 negative) from base 3925 address of kernel 3926 descriptor to kernel's 3927 entry point instruction 3928 which must be 256 byte 3929 aligned. 3930 351:272 20 Reserved, must be 0. 3931 bytes 3932 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3933 Reserved, must be 0. 3934 GFX90A, GFX940 3935 Compute Shader (CS) 3936 program settings used by 3937 CP to set up 3938 ``COMPUTE_PGM_RSRC3`` 3939 configuration 3940 register. See 3941 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3942 GFX10-GFX11 3943 Compute Shader (CS) 3944 program settings used by 3945 CP to set up 3946 ``COMPUTE_PGM_RSRC3`` 3947 configuration 3948 register. See 3949 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 3950 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3951 program settings used by 3952 CP to set up 3953 ``COMPUTE_PGM_RSRC1`` 3954 configuration 3955 register. See 3956 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 3957 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3958 program settings used by 3959 CP to set up 3960 ``COMPUTE_PGM_RSRC2`` 3961 configuration 3962 register. See 3963 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 3964 458:448 7 bits *See separate bits below.* Enable the setup of the 3965 SGPR user data registers 3966 (see 3967 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3968 3969 The total number of SGPR 3970 user data registers 3971 requested must not exceed 3972 16 and match value in 3973 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3974 Any requests beyond 16 3975 will be ignored. 3976 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3977 _BUFFER column of 3978 :ref:`amdgpu-processor-table` 3979 specifies *Architected flat 3980 scratch* then not supported 3981 and must be 0, 3982 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3983 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3984 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3985 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3986 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3987 column of 3988 :ref:`amdgpu-processor-table` 3989 specifies *Architected flat 3990 scratch* then not supported 3991 and must be 0, 3992 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3993 _SIZE 3994 457:455 3 bits Reserved, must be 0. 3995 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3996 Reserved, must be 0. 3997 GFX10-GFX11 3998 - If 0 execute in 3999 wavefront size 64 mode. 4000 - If 1 execute in 4001 native wavefront size 4002 32 mode. 4003 459 1 bit USES_DYNAMIC_STACK Indicates if the generated 4004 machine code is using a 4005 dynamically sized stack. 4006 463:460 1 bit Reserved, must be 0. 4007 464 1 bit RESERVED_464 Deprecated, must be 0. 4008 467:465 3 bits Reserved, must be 0. 4009 468 1 bit RESERVED_468 Deprecated, must be 0. 4010 469:471 3 bits Reserved, must be 0. 4011 511:472 5 bytes Reserved, must be 0. 4012 512 **Total size 64 bytes.** 4013 ======= ==================================================================== 4014 4015.. 4016 4017 .. table:: compute_pgm_rsrc1 for GFX6-GFX11 4018 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table 4019 4020 ======= ======= =============================== =========================================================================== 4021 Bits Size Field Name Description 4022 ======= ======= =============================== =========================================================================== 4023 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 4024 blocks used by each work-item; 4025 granularity is device 4026 specific: 4027 4028 GFX6-GFX9 4029 - vgprs_used 0..256 4030 - max(0, ceil(vgprs_used / 4) - 1) 4031 GFX90A, GFX940 4032 - vgprs_used 0..512 4033 - vgprs_used = align(arch_vgprs, 4) 4034 + acc_vgprs 4035 - max(0, ceil(vgprs_used / 8) - 1) 4036 GFX10-GFX11 (wavefront size 64) 4037 - max_vgpr 1..256 4038 - max(0, ceil(vgprs_used / 4) - 1) 4039 GFX10-GFX11 (wavefront size 32) 4040 - max_vgpr 1..256 4041 - max(0, ceil(vgprs_used / 8) - 1) 4042 4043 Where vgprs_used is defined 4044 as the highest VGPR number 4045 explicitly referenced plus 4046 one. 4047 4048 Used by CP to set up 4049 ``COMPUTE_PGM_RSRC1.VGPRS``. 4050 4051 The 4052 :ref:`amdgpu-assembler` 4053 calculates this 4054 automatically for the 4055 selected processor from 4056 values provided to the 4057 `.amdhsa_kernel` directive 4058 by the 4059 `.amdhsa_next_free_vgpr` 4060 nested directive (see 4061 :ref:`amdhsa-kernel-directives-table`). 4062 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 4063 blocks used by a wavefront; 4064 granularity is device 4065 specific: 4066 4067 GFX6-GFX8 4068 - sgprs_used 0..112 4069 - max(0, ceil(sgprs_used / 8) - 1) 4070 GFX9 4071 - sgprs_used 0..112 4072 - 2 * max(0, ceil(sgprs_used / 16) - 1) 4073 GFX10-GFX11 4074 Reserved, must be 0. 4075 (128 SGPRs always 4076 allocated.) 4077 4078 Where sgprs_used is 4079 defined as the highest 4080 SGPR number explicitly 4081 referenced plus one, plus 4082 a target specific number 4083 of additional special 4084 SGPRs for VCC, 4085 FLAT_SCRATCH (GFX7+) and 4086 XNACK_MASK (GFX8+), and 4087 any additional 4088 target specific 4089 limitations. It does not 4090 include the 16 SGPRs added 4091 if a trap handler is 4092 enabled. 4093 4094 The target specific 4095 limitations and special 4096 SGPR layout are defined in 4097 the hardware 4098 documentation, which can 4099 be found in the 4100 :ref:`amdgpu-processors` 4101 table. 4102 4103 Used by CP to set up 4104 ``COMPUTE_PGM_RSRC1.SGPRS``. 4105 4106 The 4107 :ref:`amdgpu-assembler` 4108 calculates this 4109 automatically for the 4110 selected processor from 4111 values provided to the 4112 `.amdhsa_kernel` directive 4113 by the 4114 `.amdhsa_next_free_sgpr` 4115 and `.amdhsa_reserve_*` 4116 nested directives (see 4117 :ref:`amdhsa-kernel-directives-table`). 4118 11:10 2 bits PRIORITY Must be 0. 4119 4120 Start executing wavefront 4121 at the specified priority. 4122 4123 CP is responsible for 4124 filling in 4125 ``COMPUTE_PGM_RSRC1.PRIORITY``. 4126 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 4127 with specified rounding 4128 mode for single (32 4129 bit) floating point 4130 precision floating point 4131 operations. 4132 4133 Floating point rounding 4134 mode values are defined in 4135 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4136 4137 Used by CP to set up 4138 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4139 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 4140 with specified rounding 4141 denorm mode for half/double (16 4142 and 64-bit) floating point 4143 precision floating point 4144 operations. 4145 4146 Floating point rounding 4147 mode values are defined in 4148 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4149 4150 Used by CP to set up 4151 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4152 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 4153 with specified denorm mode 4154 for single (32 4155 bit) floating point 4156 precision floating point 4157 operations. 4158 4159 Floating point denorm mode 4160 values are defined in 4161 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4162 4163 Used by CP to set up 4164 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4165 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 4166 with specified denorm mode 4167 for half/double (16 4168 and 64-bit) floating point 4169 precision floating point 4170 operations. 4171 4172 Floating point denorm mode 4173 values are defined in 4174 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4175 4176 Used by CP to set up 4177 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4178 20 1 bit PRIV Must be 0. 4179 4180 Start executing wavefront 4181 in privilege trap handler 4182 mode. 4183 4184 CP is responsible for 4185 filling in 4186 ``COMPUTE_PGM_RSRC1.PRIV``. 4187 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 4188 with DX10 clamp mode 4189 enabled. Used by the vector 4190 ALU to force DX10 style 4191 treatment of NaN's (when 4192 set, clamp NaN to zero, 4193 otherwise pass NaN 4194 through). 4195 4196 Used by CP to set up 4197 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 4198 22 1 bit DEBUG_MODE Must be 0. 4199 4200 Start executing wavefront 4201 in single step mode. 4202 4203 CP is responsible for 4204 filling in 4205 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 4206 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 4207 with IEEE mode 4208 enabled. Floating point 4209 opcodes that support 4210 exception flag gathering 4211 will quiet and propagate 4212 signaling-NaN inputs per 4213 IEEE 754-2008. Min_dx10 and 4214 max_dx10 become IEEE 4215 754-2008 compliant due to 4216 signaling-NaN propagation 4217 and quieting. 4218 4219 Used by CP to set up 4220 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 4221 24 1 bit BULKY Must be 0. 4222 4223 Only one work-group allowed 4224 to execute on a compute 4225 unit. 4226 4227 CP is responsible for 4228 filling in 4229 ``COMPUTE_PGM_RSRC1.BULKY``. 4230 25 1 bit CDBG_USER Must be 0. 4231 4232 Flag that can be used to 4233 control debugging code. 4234 4235 CP is responsible for 4236 filling in 4237 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 4238 26 1 bit FP16_OVFL GFX6-GFX8 4239 Reserved, must be 0. 4240 GFX9-GFX11 4241 Wavefront starts execution 4242 with specified fp16 overflow 4243 mode. 4244 4245 - If 0, fp16 overflow generates 4246 +/-INF values. 4247 - If 1, fp16 overflow that is the 4248 result of an +/-INF input value 4249 or divide by 0 produces a +/-INF, 4250 otherwise clamps computed 4251 overflow to +/-MAX_FP16 as 4252 appropriate. 4253 4254 Used by CP to set up 4255 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 4256 28:27 2 bits Reserved, must be 0. 4257 29 1 bit WGP_MODE GFX6-GFX9 4258 Reserved, must be 0. 4259 GFX10-GFX11 4260 - If 0 execute work-groups in 4261 CU wavefront execution mode. 4262 - If 1 execute work-groups on 4263 in WGP wavefront execution mode. 4264 4265 See :ref:`amdgpu-amdhsa-memory-model`. 4266 4267 Used by CP to set up 4268 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 4269 30 1 bit MEM_ORDERED GFX6-GFX9 4270 Reserved, must be 0. 4271 GFX10-GFX11 4272 Controls the behavior of the 4273 s_waitcnt's vmcnt and vscnt 4274 counters. 4275 4276 - If 0 vmcnt reports completion 4277 of load and atomic with return 4278 out of order with sample 4279 instructions, and the vscnt 4280 reports the completion of 4281 store and atomic without 4282 return in order. 4283 - If 1 vmcnt reports completion 4284 of load, atomic with return 4285 and sample instructions in 4286 order, and the vscnt reports 4287 the completion of store and 4288 atomic without return in order. 4289 4290 Used by CP to set up 4291 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 4292 31 1 bit FWD_PROGRESS GFX6-GFX9 4293 Reserved, must be 0. 4294 GFX10-GFX11 4295 - If 0 execute SIMD wavefronts 4296 using oldest first policy. 4297 - If 1 execute SIMD wavefronts to 4298 ensure wavefronts will make some 4299 forward progress. 4300 4301 Used by CP to set up 4302 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 4303 32 **Total size 4 bytes** 4304 ======= =================================================================================================================== 4305 4306.. 4307 4308 .. table:: compute_pgm_rsrc2 for GFX6-GFX11 4309 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table 4310 4311 ======= ======= =============================== =========================================================================== 4312 Bits Size Field Name Description 4313 ======= ======= =============================== =========================================================================== 4314 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 4315 private segment. 4316 * If the *Target Properties* 4317 column of 4318 :ref:`amdgpu-processor-table` 4319 does not specify 4320 *Architected flat 4321 scratch* then enable the 4322 setup of the SGPR 4323 wavefront scratch offset 4324 system register (see 4325 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4326 * If the *Target Properties* 4327 column of 4328 :ref:`amdgpu-processor-table` 4329 specifies *Architected 4330 flat scratch* then enable 4331 the setup of the 4332 FLAT_SCRATCH register 4333 pair (see 4334 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4335 4336 Used by CP to set up 4337 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4338 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4339 user data 4340 registers requested. This 4341 number must be greater than 4342 or equal to the number of user 4343 data registers enabled. 4344 4345 Used by CP to set up 4346 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4347 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4348 4349 This bit represents 4350 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4351 which is set by the CP if 4352 the runtime has installed a 4353 trap handler. 4354 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4355 system SGPR register for 4356 the work-group id in the X 4357 dimension (see 4358 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4359 4360 Used by CP to set up 4361 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4362 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4363 system SGPR register for 4364 the work-group id in the Y 4365 dimension (see 4366 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4367 4368 Used by CP to set up 4369 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4370 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4371 system SGPR register for 4372 the work-group id in the Z 4373 dimension (see 4374 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4375 4376 Used by CP to set up 4377 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4378 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4379 system SGPR register for 4380 work-group information (see 4381 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4382 4383 Used by CP to set up 4384 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4385 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4386 VGPR system registers used 4387 for the work-item ID. 4388 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4389 defines the values. 4390 4391 Used by CP to set up 4392 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4393 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4394 4395 Wavefront starts execution 4396 with address watch 4397 exceptions enabled which 4398 are generated when L1 has 4399 witnessed a thread access 4400 an *address of 4401 interest*. 4402 4403 CP is responsible for 4404 filling in the address 4405 watch bit in 4406 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4407 according to what the 4408 runtime requests. 4409 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4410 4411 Wavefront starts execution 4412 with memory violation 4413 exceptions exceptions 4414 enabled which are generated 4415 when a memory violation has 4416 occurred for this wavefront from 4417 L1 or LDS 4418 (write-to-read-only-memory, 4419 mis-aligned atomic, LDS 4420 address out of range, 4421 illegal address, etc.). 4422 4423 CP sets the memory 4424 violation bit in 4425 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4426 according to what the 4427 runtime requests. 4428 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4429 4430 CP uses the rounded value 4431 from the dispatch packet, 4432 not this value, as the 4433 dispatch may contain 4434 dynamically allocated group 4435 segment memory. CP writes 4436 directly to 4437 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4438 4439 Amount of group segment 4440 (LDS) to allocate for each 4441 work-group. Granularity is 4442 device specific: 4443 4444 GFX6 4445 roundup(lds-size / (64 * 4)) 4446 GFX7-GFX11 4447 roundup(lds-size / (128 * 4)) 4448 4449 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4450 _INVALID_OPERATION with specified exceptions 4451 enabled. 4452 4453 Used by CP to set up 4454 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4455 (set from bits 0..6). 4456 4457 IEEE 754 FP Invalid 4458 Operation 4459 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4460 _SOURCE input operands is a 4461 denormal number 4462 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4463 _DIVISION_BY_ZERO Zero 4464 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4465 _OVERFLOW 4466 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4467 _UNDERFLOW 4468 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4469 _INEXACT 4470 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4471 _ZERO (rcp_iflag_f32 instruction 4472 only) 4473 31 1 bit Reserved, must be 0. 4474 32 **Total size 4 bytes.** 4475 ======= =================================================================================================================== 4476 4477.. 4478 4479 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940 4480 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4481 4482 ======= ======= =============================== =========================================================================== 4483 Bits Size Field Name Description 4484 ======= ======= =============================== =========================================================================== 4485 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4486 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4487 63 - accum-offset = 256. 4488 6:15 10 Reserved, must be 0. 4489 bits 4490 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4491 launched in the same CU. 4492 - If 1 the waves of a work-group can be 4493 launched in different CUs. The waves 4494 cannot use S_BARRIER or LDS. 4495 17:31 15 Reserved, must be 0. 4496 bits 4497 32 **Total size 4 bytes.** 4498 ======= =================================================================================================================== 4499 4500.. 4501 4502 .. table:: compute_pgm_rsrc3 for GFX10-GFX11 4503 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table 4504 4505 ======= ======= =============================== =========================================================================== 4506 Bits Size Field Name Description 4507 ======= ======= =============================== =========================================================================== 4508 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For 4509 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity 4510 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does 4511 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0. 4512 9:4 6 bits INST_PREF_SIZE GFX10 4513 Reserved, must be 0. 4514 GFX11 4515 Number of instruction bytes to prefetch, starting at the kernel's entry 4516 point instruction, before wavefront starts execution. The value is 0..63 4517 with a granularity of 128 bytes. 4518 10 1 bit TRAP_ON_START GFX10 4519 Reserved, must be 0. 4520 GFX11 4521 Must be 0. 4522 4523 If 1, wavefront starts execution by trapping into the trap handler. 4524 4525 CP is responsible for filling in the trap on start bit in 4526 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime 4527 requests. 4528 11 1 bit TRAP_ON_END GFX10 4529 Reserved, must be 0. 4530 GFX11 4531 Must be 0. 4532 4533 If 1, wavefront execution terminates by trapping into the trap handler. 4534 4535 CP is responsible for filling in the trap on end bit in 4536 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests. 4537 30:12 19 bits Reserved, must be 0. 4538 31 1 bit IMAGE_OP GFX10 4539 Reserved, must be 0. 4540 GFX11 4541 If 1, the kernel execution contains image instructions. If executed as 4542 part of a graphics pipeline, image read instructions will stall waiting 4543 for any necessary ``WAIT_SYNC`` fence to be performed in order to 4544 indicate that earlier pipeline stages have completed writing to the 4545 image. 4546 4547 Not used for compute kernels that are not part of a graphics pipeline and 4548 must be 0. 4549 32 **Total size 4 bytes.** 4550 ======= =================================================================================================================== 4551 4552.. 4553 4554 .. table:: Floating Point Rounding Mode Enumeration Values 4555 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4556 4557 ====================================== ===== ============================== 4558 Enumeration Name Value Description 4559 ====================================== ===== ============================== 4560 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4561 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4562 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4563 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4564 ====================================== ===== ============================== 4565 4566.. 4567 4568 .. table:: Floating Point Denorm Mode Enumeration Values 4569 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4570 4571 ====================================== ===== ============================== 4572 Enumeration Name Value Description 4573 ====================================== ===== ============================== 4574 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4575 Denorms 4576 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4577 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4578 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4579 ====================================== ===== ============================== 4580 4581.. 4582 4583 .. table:: System VGPR Work-Item ID Enumeration Values 4584 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4585 4586 ======================================== ===== ============================ 4587 Enumeration Name Value Description 4588 ======================================== ===== ============================ 4589 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4590 ID. 4591 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4592 dimensions ID. 4593 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4594 dimensions ID. 4595 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4596 ======================================== ===== ============================ 4597 4598.. _amdgpu-amdhsa-initial-kernel-execution-state: 4599 4600Initial Kernel Execution State 4601~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4602 4603This section defines the register state that will be set up by the packet 4604processor prior to the start of execution of every wavefront. This is limited by 4605the constraints of the hardware controllers of CP/ADC/SPI. 4606 4607The order of the SGPR registers is defined, but the compiler can specify which 4608ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4609fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4610for enabled registers are dense starting at SGPR0: the first enabled register is 4611SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4612an SGPR number. 4613 4614The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4615all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4616using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4617actually initialized. These are then immediately followed by the System SGPRs 4618that are set up by ADC/SPI and can have different values for each wavefront of 4619the grid dispatch. 4620 4621SGPR register initial state is defined in 4622:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4623 4624 .. table:: SGPR Register Set Up Order 4625 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4626 4627 ========== ========================== ====== ============================== 4628 SGPR Order Name Number Description 4629 (kernel descriptor enable of 4630 field) SGPRs 4631 ========== ========================== ====== ============================== 4632 First Private Segment Buffer 4 See 4633 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4634 _segment_buffer) 4635 then Dispatch Ptr 2 64-bit address of AQL dispatch 4636 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4637 actually executing. 4638 then Queue Ptr 2 64-bit address of amd_queue_t 4639 (enable_sgpr_queue_ptr) object for AQL queue on which 4640 the dispatch packet was 4641 queued. 4642 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4643 (enable_sgpr_kernarg segment. This is directly 4644 _segment_ptr) copied from the 4645 kernarg_address in the kernel 4646 dispatch packet. 4647 4648 Having CP load it once avoids 4649 loading it at the beginning of 4650 every wavefront. 4651 then Dispatch Id 2 64-bit Dispatch ID of the 4652 (enable_sgpr_dispatch_id) dispatch packet being 4653 executed. 4654 then Flat Scratch Init 2 See 4655 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4656 _init) 4657 then Private Segment Size 1 The 32-bit byte size of a 4658 (enable_sgpr_private single work-item's memory 4659 _segment_size) allocation. This is the 4660 value from the kernel 4661 dispatch packet Private 4662 Segment Byte Size rounded up 4663 by CP to a multiple of 4664 DWORD. 4665 4666 Having CP load it once avoids 4667 loading it at the beginning of 4668 every wavefront. 4669 4670 This is not used for 4671 GFX7-GFX8 since it is the same 4672 value as the second SGPR of 4673 Flat Scratch Init. However, it 4674 may be needed for GFX9-GFX11 which 4675 changes the meaning of the 4676 Flat Scratch Init value. 4677 then Work-Group Id X 1 32-bit work-group id in X 4678 (enable_sgpr_workgroup_id dimension of grid for 4679 _X) wavefront. 4680 then Work-Group Id Y 1 32-bit work-group id in Y 4681 (enable_sgpr_workgroup_id dimension of grid for 4682 _Y) wavefront. 4683 then Work-Group Id Z 1 32-bit work-group id in Z 4684 (enable_sgpr_workgroup_id dimension of grid for 4685 _Z) wavefront. 4686 then Work-Group Info 1 {first_wavefront, 14'b0000, 4687 (enable_sgpr_workgroup ordered_append_term[10:0], 4688 _info) threadgroup_size_in_wavefronts[5:0]} 4689 then Scratch Wavefront Offset 1 See 4690 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4691 _segment_wavefront_offset) and 4692 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4693 ========== ========================== ====== ============================== 4694 4695The order of the VGPR registers is defined, but the compiler can specify which 4696ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4697fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4698for enabled registers are dense starting at VGPR0: the first enabled register is 4699VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4700VGPR number. 4701 4702There are different methods used for the VGPR initial state: 4703 4704* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4705 specifies otherwise, a separate VGPR register is used per work-item ID. The 4706 VGPR register initial state for this method is defined in 4707 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4708* If *Target Properties* column of :ref:`amdgpu-processor-table` 4709 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4710 for all work-item IDs. The register layout for this method is defined in 4711 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4712 4713 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4714 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4715 4716 ========== ========================== ====== ============================== 4717 VGPR Order Name Number Description 4718 (kernel descriptor enable of 4719 field) VGPRs 4720 ========== ========================== ====== ============================== 4721 First Work-Item Id X 1 32-bit work-item id in X 4722 (Always initialized) dimension of work-group for 4723 wavefront lane. 4724 then Work-Item Id Y 1 32-bit work-item id in Y 4725 (enable_vgpr_workitem_id dimension of work-group for 4726 > 0) wavefront lane. 4727 then Work-Item Id Z 1 32-bit work-item id in Z 4728 (enable_vgpr_workitem_id dimension of work-group for 4729 > 1) wavefront lane. 4730 ========== ========================== ====== ============================== 4731 4732.. 4733 4734 .. table:: Register Layout for Packed Work-Item ID Method 4735 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4736 4737 ======= ======= ================ ========================================= 4738 Bits Size Field Name Description 4739 ======= ======= ================ ========================================= 4740 0:9 10 bits Work-Item Id X Work-item id in X 4741 dimension of work-group for 4742 wavefront lane. 4743 4744 Always initialized. 4745 4746 10:19 10 bits Work-Item Id Y Work-item id in Y 4747 dimension of work-group for 4748 wavefront lane. 4749 4750 Initialized if enable_vgpr_workitem_id > 4751 0, otherwise set to 0. 4752 20:29 10 bits Work-Item Id Z Work-item id in Z 4753 dimension of work-group for 4754 wavefront lane. 4755 4756 Initialized if enable_vgpr_workitem_id > 4757 1, otherwise set to 0. 4758 30:31 2 bits Reserved, set to 0. 4759 ======= ======= ================ ========================================= 4760 4761The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4762 47631. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4764 registers. 47652. Work-group Id registers X, Y, Z are set by ADC which supports any 4766 combination including none. 47673. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4768 its value cannot be included with the flat scratch init value which is per 4769 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 47704. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4771 or (X, Y, Z). 47725. Flat Scratch register pair initialization is described in 4773 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4774 4775The global segment can be accessed either using buffer instructions (GFX6 which 4776has V# 64-bit address support), flat instructions (GFX7-GFX11), or global 4777instructions (GFX9-GFX11). 4778 4779If buffer operations are used, then the compiler can generate a V# with the 4780following properties: 4781 4782* base address of 0 4783* no swizzle 4784* ATC: 1 if IOMMU present (such as APU) 4785* ptr64: 1 4786* MTYPE set to support memory coherence that matches the runtime (such as CC for 4787 APU and NC for dGPU). 4788 4789.. _amdgpu-amdhsa-kernel-prolog: 4790 4791Kernel Prolog 4792~~~~~~~~~~~~~ 4793 4794The compiler performs initialization in the kernel prologue depending on the 4795target and information about things like stack usage in the kernel and called 4796functions. Some of this initialization requires the compiler to request certain 4797User and System SGPRs be present in the 4798:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4799:ref:`amdgpu-amdhsa-kernel-descriptor`. 4800 4801.. _amdgpu-amdhsa-kernel-prolog-cfi: 4802 4803CFI 4804+++ 4805 48061. The CFI return address is undefined. 4807 48082. The CFI CFA is defined using an expression which evaluates to a location 4809 description that comprises one memory location description for the 4810 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4811 4812.. _amdgpu-amdhsa-kernel-prolog-m0: 4813 4814M0 4815++ 4816 4817GFX6-GFX8 4818 The M0 register must be initialized with a value at least the total LDS size 4819 if the kernel may access LDS via DS or flat operations. Total LDS size is 4820 available in dispatch packet. For M0, it is also possible to use maximum 4821 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4822 GFX7-GFX8). 4823GFX9-GFX11 4824 The M0 register is not used for range checking LDS accesses and so does not 4825 need to be initialized in the prolog. 4826 4827.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4828 4829Stack Pointer 4830+++++++++++++ 4831 4832If the kernel has function calls it must set up the ABI stack pointer described 4833in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4834SGPR32 to the unswizzled scratch offset of the address past the last local 4835allocation. 4836 4837.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4838 4839Frame Pointer 4840+++++++++++++ 4841 4842If the kernel needs a frame pointer for the reasons defined in 4843``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4844kernel prolog. If a frame pointer is not required then all uses of the frame 4845pointer are replaced with immediate ``0`` offsets. 4846 4847.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4848 4849Flat Scratch 4850++++++++++++ 4851 4852There are different methods used for initializing flat scratch: 4853 4854* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4855 specifies *Does not support generic address space*: 4856 4857 Flat scratch is not supported and there is no flat scratch register pair. 4858 4859* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4860 specifies *Offset flat scratch*: 4861 4862 If the kernel or any function it calls may use flat operations to access 4863 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4864 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4865 Scratch Wavefront Offset SGPR registers (see 4866 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4867 4868 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4869 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4870 being managed by SPI for the queue executing the kernel dispatch. This is 4871 the same value used in the Scratch Segment Buffer V# base address. 4872 4873 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4874 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4875 4876 The prolog must add the value of Scratch Wavefront Offset to get the 4877 wavefront's byte scratch backing memory offset from 4878 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4879 4880 The Scratch Wavefront Offset must also be used as an offset with Private 4881 segment address when using the Scratch Segment Buffer. 4882 4883 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4884 shifted by 8 before moving into FLAT_SCRATCH_HI. 4885 4886 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4887 SGPRn is the highest numbered SGPR allocated to the wavefront). 4888 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4889 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4890 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4891 aperture. 4892 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4893 work-items scratch memory usage. 4894 4895 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4896 checks that the value in the kernel dispatch packet Private Segment Byte 4897 Size is not larger and requests the runtime to increase the queue's scratch 4898 size if necessary. 4899 4900 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4901 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4902 loading it at the beginning of every wavefront. 4903 4904 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4905 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4906 in flat memory instructions. 4907 4908* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4909 specifies *Absolute flat scratch*: 4910 4911 If the kernel or any function it calls may use flat operations to access 4912 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4913 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4914 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4915 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4916 4917 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4918 memory being managed by SPI for the queue executing the kernel dispatch. 4919 4920 CP obtains this from the runtime. 4921 4922 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4923 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4924 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4925 memory instructions. 4926 4927 The Scratch Wavefront Offset must also be used as an offset with Private 4928 segment address when using the Scratch Segment Buffer (see 4929 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4930 4931* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4932 specifies *Architected flat scratch*: 4933 4934 If ENABLE_PRIVATE_SEGMENT is enabled in 4935 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH 4936 register pair will be initialized to the 64-bit address of the base of scratch 4937 backing memory being managed by SPI for the queue executing the kernel 4938 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4939 flat scratch base in flat memory instructions. 4940 4941.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4942 4943Private Segment Buffer 4944++++++++++++++++++++++ 4945 4946If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4947*Architected flat scratch* then a Private Segment Buffer is not supported. 4948Instead the flat SCRATCH instructions are used. 4949 4950Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4951that are used as a V# to access scratch. CP uses the value provided by the 4952runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4953access the private memory space using a segment address. See 4954:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4955 4956The scratch V# is a four-aligned SGPR and always selected for the kernel as 4957follows: 4958 4959 - If it is known during instruction selection that there is stack usage, 4960 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4961 optimizations are disabled (``-O0``), if stack objects already exist (for 4962 locals, etc.), or if there are any function calls. 4963 4964 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4965 are reserved for the tentative scratch V#. These will be used if it is 4966 determined that spilling is needed. 4967 4968 - If no use is made of the tentative scratch V#, then it is unreserved, 4969 and the register count is determined ignoring it. 4970 - If use is made of the tentative scratch V#, then its register numbers 4971 are shifted to the first four-aligned SGPR index after the highest one 4972 allocated by the register allocator, and all uses are updated. The 4973 register count includes them in the shifted location. 4974 - In either case, if the processor has the SGPR allocation bug, the 4975 tentative allocation is not shifted or unreserved in order to ensure 4976 the register count is higher to workaround the bug. 4977 4978 .. note:: 4979 4980 This approach of using a tentative scratch V# and shifting the register 4981 numbers if used avoids having to perform register allocation a second 4982 time if the tentative V# is eliminated. This is more efficient and 4983 avoids the problem that the second register allocation may perform 4984 spilling which will fail as there is no longer a scratch V#. 4985 4986When the kernel prolog code is being emitted it is known whether the scratch V# 4987described above is actually used. If it is, the prolog code must set it up by 4988copying the Private Segment Buffer to the scratch V# registers and then adding 4989the Private Segment Wavefront Offset to the queue base address in the V#. The 4990result is a V# with a base address pointing to the beginning of the wavefront 4991scratch backing memory. 4992 4993The Private Segment Buffer is always requested, but the Private Segment 4994Wavefront Offset is only requested if it is used (see 4995:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4996 4997.. _amdgpu-amdhsa-memory-model: 4998 4999Memory Model 5000~~~~~~~~~~~~ 5001 5002This section describes the mapping of the LLVM memory model onto AMDGPU machine 5003code (see :ref:`memmodel`). 5004 5005The AMDGPU backend supports the memory synchronization scopes specified in 5006:ref:`amdgpu-memory-scopes`. 5007 5008The code sequences used to implement the memory model specify the order of 5009instructions that a single thread must execute. The ``s_waitcnt`` and cache 5010management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 5011to other memory instructions executed by the same thread. This allows them to be 5012moved earlier or later which can allow them to be combined with other instances 5013of the same instruction, or hoisted/sunk out of loops to improve performance. 5014Only the instructions related to the memory model are given; additional 5015``s_waitcnt`` instructions are required to ensure registers are defined before 5016being used. These may be able to be combined with the memory model ``s_waitcnt`` 5017instructions as described above. 5018 5019The AMDGPU backend supports the following memory models: 5020 5021 HSA Memory Model [HSA]_ 5022 The HSA memory model uses a single happens-before relation for all address 5023 spaces (see :ref:`amdgpu-address-spaces`). 5024 OpenCL Memory Model [OpenCL]_ 5025 The OpenCL memory model which has separate happens-before relations for the 5026 global and local address spaces. Only a fence specifying both global and 5027 local address space, and seq_cst instructions join the relationships. Since 5028 the LLVM ``memfence`` instruction does not allow an address space to be 5029 specified the OpenCL fence has to conservatively assume both local and 5030 global address space was specified. However, optimizations can often be 5031 done to eliminate the additional ``s_waitcnt`` instructions when there are 5032 no intervening memory instructions which access the corresponding address 5033 space. The code sequences in the table indicate what can be omitted for the 5034 OpenCL memory. The target triple environment is used to determine if the 5035 source language is OpenCL (see :ref:`amdgpu-opencl`). 5036 5037``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 5038operations. 5039 5040``buffer/global/flat_load/store/atomic`` instructions to global memory are 5041termed vector memory operations. 5042 5043Private address space uses ``buffer_load/store`` using the scratch V# 5044(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread 5045is accessing the memory, atomic memory orderings are not meaningful, and all 5046accesses are treated as non-atomic. 5047 5048Constant address space uses ``buffer/global_load`` instructions (or equivalent 5049scalar memory instructions). Since the constant address space contents do not 5050change during the execution of a kernel dispatch it is not legal to perform 5051stores, and atomic memory orderings are not meaningful, and all accesses are 5052treated as non-atomic. 5053 5054A memory synchronization scope wider than work-group is not meaningful for the 5055group (LDS) address space and is treated as work-group. 5056 5057The memory model does not support the region address space which is treated as 5058non-atomic. 5059 5060Acquire memory ordering is not meaningful on store atomic instructions and is 5061treated as non-atomic. 5062 5063Release memory ordering is not meaningful on load atomic instructions and is 5064treated a non-atomic. 5065 5066Acquire-release memory ordering is not meaningful on load or store atomic 5067instructions and is treated as acquire and release respectively. 5068 5069The memory order also adds the single thread optimization constraints defined in 5070table 5071:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 5072 5073 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 5074 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 5075 5076 ============ ============================================================== 5077 LLVM Memory Optimization Constraints 5078 Ordering 5079 ============ ============================================================== 5080 unordered *none* 5081 monotonic *none* 5082 acquire - If a load atomic/atomicrmw then no following load/load 5083 atomic/store/store atomic/atomicrmw/fence instruction can be 5084 moved before the acquire. 5085 - If a fence then same as load atomic, plus no preceding 5086 associated fence-paired-atomic can be moved after the fence. 5087 release - If a store atomic/atomicrmw then no preceding load/load 5088 atomic/store/store atomic/atomicrmw/fence instruction can be 5089 moved after the release. 5090 - If a fence then same as store atomic, plus no following 5091 associated fence-paired-atomic can be moved before the 5092 fence. 5093 acq_rel Same constraints as both acquire and release. 5094 seq_cst - If a load atomic then same constraints as acquire, plus no 5095 preceding sequentially consistent load atomic/store 5096 atomic/atomicrmw/fence instruction can be moved after the 5097 seq_cst. 5098 - If a store atomic then the same constraints as release, plus 5099 no following sequentially consistent load atomic/store 5100 atomic/atomicrmw/fence instruction can be moved before the 5101 seq_cst. 5102 - If an atomicrmw/fence then same constraints as acq_rel. 5103 ============ ============================================================== 5104 5105The code sequences used to implement the memory model are defined in the 5106following sections: 5107 5108* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 5109* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 5110* :ref:`amdgpu-amdhsa-memory-model-gfx940` 5111* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11` 5112 5113.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 5114 5115Memory Model GFX6-GFX9 5116++++++++++++++++++++++ 5117 5118For GFX6-GFX9: 5119 5120* Each agent has multiple shader arrays (SA). 5121* Each SA has multiple compute units (CU). 5122* Each CU has multiple SIMDs that execute wavefronts. 5123* The wavefronts for a single work-group are executed in the same CU but may be 5124 executed by different SIMDs. 5125* Each CU has a single LDS memory shared by the wavefronts of the work-groups 5126 executing on it. 5127* All LDS operations of a CU are performed as wavefront wide operations in a 5128 global order and involve no caching. Completion is reported to a wavefront in 5129 execution order. 5130* The LDS memory has multiple request queues shared by the SIMDs of a 5131 CU. Therefore, the LDS operations performed by different wavefronts of a 5132 work-group can be reordered relative to each other, which can result in 5133 reordering the visibility of vector memory operations with respect to LDS 5134 operations of other wavefronts in the same work-group. A ``s_waitcnt 5135 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 5136 vector memory operations between wavefronts of a work-group, but not between 5137 operations performed by the same wavefront. 5138* The vector memory operations are performed as wavefront wide operations and 5139 completion is reported to a wavefront in execution order. The exception is 5140 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 5141 vector memory order if they access LDS memory, and out of LDS operation order 5142 if they access global memory. 5143* The vector memory operations access a single vector L1 cache shared by all 5144 SIMDs a CU. Therefore, no special action is required for coherence between the 5145 lanes of a single wavefront, or for coherence between wavefronts in the same 5146 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 5147 wavefronts executing in different work-groups as they may be executing on 5148 different CUs. 5149* The scalar memory operations access a scalar L1 cache shared by all wavefronts 5150 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 5151 scalar operations are used in a restricted way so do not impact the memory 5152 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 5153* The vector and scalar memory operations use an L2 cache shared by all CUs on 5154 the same agent. 5155* The L2 cache has independent channels to service disjoint ranges of virtual 5156 addresses. 5157* Each CU has a separate request queue per channel. Therefore, the vector and 5158 scalar memory operations performed by wavefronts executing in different 5159 work-groups (which may be executing on different CUs) of an agent can be 5160 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 5161 ensure synchronization between vector memory operations of different CUs. It 5162 ensures a previous vector memory operation has completed before executing a 5163 subsequent vector memory or LDS operation and so can be used to meet the 5164 requirements of acquire and release. 5165* The L2 cache can be kept coherent with other agents on some targets, or ranges 5166 of virtual addresses can be set up to bypass it to ensure system coherence. 5167 5168Scalar memory operations are only used to access memory that is proven to not 5169change during the execution of the kernel dispatch. This includes constant 5170address space and global address space for program scope ``const`` variables. 5171Therefore, the kernel machine code does not have to maintain the scalar cache to 5172ensure it is coherent with the vector caches. The scalar and vector caches are 5173invalidated between kernel dispatches by CP since constant address space data 5174may change between kernel dispatch executions. See 5175:ref:`amdgpu-amdhsa-memory-spaces`. 5176 5177The one exception is if scalar writes are used to spill SGPR registers. In this 5178case the AMDGPU backend ensures the memory location used to spill is never 5179accessed by vector memory operations at the same time. If scalar writes are used 5180then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 5181return since the locations may be used for vector memory instructions by a 5182future wavefront that uses the same scratch area, or a function call that 5183creates a frame at the same address, respectively. There is no need for a 5184``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 5185 5186For kernarg backing memory: 5187 5188* CP invalidates the L1 cache at the start of each kernel dispatch. 5189* On dGPU the kernarg backing memory is allocated in host memory accessed as 5190 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 5191 causes it to be treated as non-volatile and so is not invalidated by 5192 ``*_vol``. 5193* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 5194 and so the L2 cache will be coherent with the CPU and other agents. 5195 5196Scratch backing memory (which is used for the private address space) is accessed 5197with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 5198only accessed by a single thread, and is always write-before-read, there is 5199never a need to invalidate these entries from the L1 cache. Hence all cache 5200invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 5201 5202The code sequences used to implement the memory model for GFX6-GFX9 are defined 5203in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 5204 5205 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 5206 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 5207 5208 ============ ============ ============== ========== ================================ 5209 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 5210 Ordering Sync Scope Address GFX6-GFX9 5211 Space 5212 ============ ============ ============== ========== ================================ 5213 **Non-Atomic** 5214 ------------------------------------------------------------------------------------ 5215 load *none* *none* - global - !volatile & !nontemporal 5216 - generic 5217 - private 1. buffer/global/flat_load 5218 - constant 5219 - !volatile & nontemporal 5220 5221 1. buffer/global/flat_load 5222 glc=1 slc=1 5223 5224 - volatile 5225 5226 1. buffer/global/flat_load 5227 glc=1 5228 2. s_waitcnt vmcnt(0) 5229 5230 - Must happen before 5231 any following volatile 5232 global/generic 5233 load/store. 5234 - Ensures that 5235 volatile 5236 operations to 5237 different 5238 addresses will not 5239 be reordered by 5240 hardware. 5241 5242 load *none* *none* - local 1. ds_load 5243 store *none* *none* - global - !volatile & !nontemporal 5244 - generic 5245 - private 1. buffer/global/flat_store 5246 - constant 5247 - !volatile & nontemporal 5248 5249 1. buffer/global/flat_store 5250 glc=1 slc=1 5251 5252 - volatile 5253 5254 1. buffer/global/flat_store 5255 2. s_waitcnt vmcnt(0) 5256 5257 - Must happen before 5258 any following volatile 5259 global/generic 5260 load/store. 5261 - Ensures that 5262 volatile 5263 operations to 5264 different 5265 addresses will not 5266 be reordered by 5267 hardware. 5268 5269 store *none* *none* - local 1. ds_store 5270 **Unordered Atomic** 5271 ------------------------------------------------------------------------------------ 5272 load atomic unordered *any* *any* *Same as non-atomic*. 5273 store atomic unordered *any* *any* *Same as non-atomic*. 5274 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 5275 **Monotonic Atomic** 5276 ------------------------------------------------------------------------------------ 5277 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 5278 - wavefront - local 5279 - workgroup - generic 5280 load atomic monotonic - agent - global 1. buffer/global/flat_load 5281 - system - generic glc=1 5282 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 5283 - wavefront - generic 5284 - workgroup 5285 - agent 5286 - system 5287 store atomic monotonic - singlethread - local 1. ds_store 5288 - wavefront 5289 - workgroup 5290 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 5291 - wavefront - generic 5292 - workgroup 5293 - agent 5294 - system 5295 atomicrmw monotonic - singlethread - local 1. ds_atomic 5296 - wavefront 5297 - workgroup 5298 **Acquire Atomic** 5299 ------------------------------------------------------------------------------------ 5300 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 5301 - wavefront - local 5302 - generic 5303 load atomic acquire - workgroup - global 1. buffer/global_load 5304 load atomic acquire - workgroup - local 1. ds/flat_load 5305 - generic 2. s_waitcnt lgkmcnt(0) 5306 5307 - If OpenCL, omit. 5308 - Must happen before 5309 any following 5310 global/generic 5311 load/load 5312 atomic/store/store 5313 atomic/atomicrmw. 5314 - Ensures any 5315 following global 5316 data read is no 5317 older than a local load 5318 atomic value being 5319 acquired. 5320 5321 load atomic acquire - agent - global 1. buffer/global_load 5322 - system glc=1 5323 2. s_waitcnt vmcnt(0) 5324 5325 - Must happen before 5326 following 5327 buffer_wbinvl1_vol. 5328 - Ensures the load 5329 has completed 5330 before invalidating 5331 the cache. 5332 5333 3. buffer_wbinvl1_vol 5334 5335 - Must happen before 5336 any following 5337 global/generic 5338 load/load 5339 atomic/atomicrmw. 5340 - Ensures that 5341 following 5342 loads will not see 5343 stale global data. 5344 5345 load atomic acquire - agent - generic 1. flat_load glc=1 5346 - system 2. s_waitcnt vmcnt(0) & 5347 lgkmcnt(0) 5348 5349 - If OpenCL omit 5350 lgkmcnt(0). 5351 - Must happen before 5352 following 5353 buffer_wbinvl1_vol. 5354 - Ensures the flat_load 5355 has completed 5356 before invalidating 5357 the cache. 5358 5359 3. buffer_wbinvl1_vol 5360 5361 - Must happen before 5362 any following 5363 global/generic 5364 load/load 5365 atomic/atomicrmw. 5366 - Ensures that 5367 following loads 5368 will not see stale 5369 global data. 5370 5371 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5372 - wavefront - local 5373 - generic 5374 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5375 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5376 - generic 2. s_waitcnt lgkmcnt(0) 5377 5378 - If OpenCL, omit. 5379 - Must happen before 5380 any following 5381 global/generic 5382 load/load 5383 atomic/store/store 5384 atomic/atomicrmw. 5385 - Ensures any 5386 following global 5387 data read is no 5388 older than a local 5389 atomicrmw value 5390 being acquired. 5391 5392 atomicrmw acquire - agent - global 1. buffer/global_atomic 5393 - system 2. s_waitcnt vmcnt(0) 5394 5395 - Must happen before 5396 following 5397 buffer_wbinvl1_vol. 5398 - Ensures the 5399 atomicrmw has 5400 completed before 5401 invalidating the 5402 cache. 5403 5404 3. buffer_wbinvl1_vol 5405 5406 - Must happen before 5407 any following 5408 global/generic 5409 load/load 5410 atomic/atomicrmw. 5411 - Ensures that 5412 following loads 5413 will not see stale 5414 global data. 5415 5416 atomicrmw acquire - agent - generic 1. flat_atomic 5417 - system 2. s_waitcnt vmcnt(0) & 5418 lgkmcnt(0) 5419 5420 - If OpenCL, omit 5421 lgkmcnt(0). 5422 - Must happen before 5423 following 5424 buffer_wbinvl1_vol. 5425 - Ensures the 5426 atomicrmw has 5427 completed before 5428 invalidating the 5429 cache. 5430 5431 3. buffer_wbinvl1_vol 5432 5433 - Must happen before 5434 any following 5435 global/generic 5436 load/load 5437 atomic/atomicrmw. 5438 - Ensures that 5439 following loads 5440 will not see stale 5441 global data. 5442 5443 fence acquire - singlethread *none* *none* 5444 - wavefront 5445 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5446 5447 - If OpenCL and 5448 address space is 5449 not generic, omit. 5450 - However, since LLVM 5451 currently has no 5452 address space on 5453 the fence need to 5454 conservatively 5455 always generate. If 5456 fence had an 5457 address space then 5458 set to address 5459 space of OpenCL 5460 fence flag, or to 5461 generic if both 5462 local and global 5463 flags are 5464 specified. 5465 - Must happen after 5466 any preceding 5467 local/generic load 5468 atomic/atomicrmw 5469 with an equal or 5470 wider sync scope 5471 and memory ordering 5472 stronger than 5473 unordered (this is 5474 termed the 5475 fence-paired-atomic). 5476 - Must happen before 5477 any following 5478 global/generic 5479 load/load 5480 atomic/store/store 5481 atomic/atomicrmw. 5482 - Ensures any 5483 following global 5484 data read is no 5485 older than the 5486 value read by the 5487 fence-paired-atomic. 5488 5489 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5490 - system vmcnt(0) 5491 5492 - If OpenCL and 5493 address space is 5494 not generic, omit 5495 lgkmcnt(0). 5496 - However, since LLVM 5497 currently has no 5498 address space on 5499 the fence need to 5500 conservatively 5501 always generate 5502 (see comment for 5503 previous fence). 5504 - Could be split into 5505 separate s_waitcnt 5506 vmcnt(0) and 5507 s_waitcnt 5508 lgkmcnt(0) to allow 5509 them to be 5510 independently moved 5511 according to the 5512 following rules. 5513 - s_waitcnt vmcnt(0) 5514 must happen after 5515 any preceding 5516 global/generic load 5517 atomic/atomicrmw 5518 with an equal or 5519 wider sync scope 5520 and memory ordering 5521 stronger than 5522 unordered (this is 5523 termed the 5524 fence-paired-atomic). 5525 - s_waitcnt lgkmcnt(0) 5526 must happen after 5527 any preceding 5528 local/generic load 5529 atomic/atomicrmw 5530 with an equal or 5531 wider sync scope 5532 and memory ordering 5533 stronger than 5534 unordered (this is 5535 termed the 5536 fence-paired-atomic). 5537 - Must happen before 5538 the following 5539 buffer_wbinvl1_vol. 5540 - Ensures that the 5541 fence-paired atomic 5542 has completed 5543 before invalidating 5544 the 5545 cache. Therefore 5546 any following 5547 locations read must 5548 be no older than 5549 the value read by 5550 the 5551 fence-paired-atomic. 5552 5553 2. buffer_wbinvl1_vol 5554 5555 - Must happen before any 5556 following global/generic 5557 load/load 5558 atomic/store/store 5559 atomic/atomicrmw. 5560 - Ensures that 5561 following loads 5562 will not see stale 5563 global data. 5564 5565 **Release Atomic** 5566 ------------------------------------------------------------------------------------ 5567 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5568 - wavefront - local 5569 - generic 5570 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5571 - generic 5572 - If OpenCL, omit. 5573 - Must happen after 5574 any preceding 5575 local/generic 5576 load/store/load 5577 atomic/store 5578 atomic/atomicrmw. 5579 - Must happen before 5580 the following 5581 store. 5582 - Ensures that all 5583 memory operations 5584 to local have 5585 completed before 5586 performing the 5587 store that is being 5588 released. 5589 5590 2. buffer/global/flat_store 5591 store atomic release - workgroup - local 1. ds_store 5592 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5593 - system - generic vmcnt(0) 5594 5595 - If OpenCL and 5596 address space is 5597 not generic, omit 5598 lgkmcnt(0). 5599 - Could be split into 5600 separate s_waitcnt 5601 vmcnt(0) and 5602 s_waitcnt 5603 lgkmcnt(0) to allow 5604 them to be 5605 independently moved 5606 according to the 5607 following rules. 5608 - s_waitcnt vmcnt(0) 5609 must happen after 5610 any preceding 5611 global/generic 5612 load/store/load 5613 atomic/store 5614 atomic/atomicrmw. 5615 - s_waitcnt lgkmcnt(0) 5616 must happen after 5617 any preceding 5618 local/generic 5619 load/store/load 5620 atomic/store 5621 atomic/atomicrmw. 5622 - Must happen before 5623 the following 5624 store. 5625 - Ensures that all 5626 memory operations 5627 to memory have 5628 completed before 5629 performing the 5630 store that is being 5631 released. 5632 5633 2. buffer/global/flat_store 5634 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5635 - wavefront - local 5636 - generic 5637 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5638 - generic 5639 - If OpenCL, omit. 5640 - Must happen after 5641 any preceding 5642 local/generic 5643 load/store/load 5644 atomic/store 5645 atomic/atomicrmw. 5646 - Must happen before 5647 the following 5648 atomicrmw. 5649 - Ensures that all 5650 memory operations 5651 to local have 5652 completed before 5653 performing the 5654 atomicrmw that is 5655 being released. 5656 5657 2. buffer/global/flat_atomic 5658 atomicrmw release - workgroup - local 1. ds_atomic 5659 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5660 - system - generic vmcnt(0) 5661 5662 - If OpenCL, omit 5663 lgkmcnt(0). 5664 - Could be split into 5665 separate s_waitcnt 5666 vmcnt(0) and 5667 s_waitcnt 5668 lgkmcnt(0) to allow 5669 them to be 5670 independently moved 5671 according to the 5672 following rules. 5673 - s_waitcnt vmcnt(0) 5674 must happen after 5675 any preceding 5676 global/generic 5677 load/store/load 5678 atomic/store 5679 atomic/atomicrmw. 5680 - s_waitcnt lgkmcnt(0) 5681 must happen after 5682 any preceding 5683 local/generic 5684 load/store/load 5685 atomic/store 5686 atomic/atomicrmw. 5687 - Must happen before 5688 the following 5689 atomicrmw. 5690 - Ensures that all 5691 memory operations 5692 to global and local 5693 have completed 5694 before performing 5695 the atomicrmw that 5696 is being released. 5697 5698 2. buffer/global/flat_atomic 5699 fence release - singlethread *none* *none* 5700 - wavefront 5701 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5702 5703 - If OpenCL and 5704 address space is 5705 not generic, omit. 5706 - However, since LLVM 5707 currently has no 5708 address space on 5709 the fence need to 5710 conservatively 5711 always generate. If 5712 fence had an 5713 address space then 5714 set to address 5715 space of OpenCL 5716 fence flag, or to 5717 generic if both 5718 local and global 5719 flags are 5720 specified. 5721 - Must happen after 5722 any preceding 5723 local/generic 5724 load/load 5725 atomic/store/store 5726 atomic/atomicrmw. 5727 - Must happen before 5728 any following store 5729 atomic/atomicrmw 5730 with an equal or 5731 wider sync scope 5732 and memory ordering 5733 stronger than 5734 unordered (this is 5735 termed the 5736 fence-paired-atomic). 5737 - Ensures that all 5738 memory operations 5739 to local have 5740 completed before 5741 performing the 5742 following 5743 fence-paired-atomic. 5744 5745 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5746 - system vmcnt(0) 5747 5748 - If OpenCL and 5749 address space is 5750 not generic, omit 5751 lgkmcnt(0). 5752 - If OpenCL and 5753 address space is 5754 local, omit 5755 vmcnt(0). 5756 - However, since LLVM 5757 currently has no 5758 address space on 5759 the fence need to 5760 conservatively 5761 always generate. If 5762 fence had an 5763 address space then 5764 set to address 5765 space of OpenCL 5766 fence flag, or to 5767 generic if both 5768 local and global 5769 flags are 5770 specified. 5771 - Could be split into 5772 separate s_waitcnt 5773 vmcnt(0) and 5774 s_waitcnt 5775 lgkmcnt(0) to allow 5776 them to be 5777 independently moved 5778 according to the 5779 following rules. 5780 - s_waitcnt vmcnt(0) 5781 must happen after 5782 any preceding 5783 global/generic 5784 load/store/load 5785 atomic/store 5786 atomic/atomicrmw. 5787 - s_waitcnt lgkmcnt(0) 5788 must happen after 5789 any preceding 5790 local/generic 5791 load/store/load 5792 atomic/store 5793 atomic/atomicrmw. 5794 - Must happen before 5795 any following store 5796 atomic/atomicrmw 5797 with an equal or 5798 wider sync scope 5799 and memory ordering 5800 stronger than 5801 unordered (this is 5802 termed the 5803 fence-paired-atomic). 5804 - Ensures that all 5805 memory operations 5806 have 5807 completed before 5808 performing the 5809 following 5810 fence-paired-atomic. 5811 5812 **Acquire-Release Atomic** 5813 ------------------------------------------------------------------------------------ 5814 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5815 - wavefront - local 5816 - generic 5817 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5818 5819 - If OpenCL, omit. 5820 - Must happen after 5821 any preceding 5822 local/generic 5823 load/store/load 5824 atomic/store 5825 atomic/atomicrmw. 5826 - Must happen before 5827 the following 5828 atomicrmw. 5829 - Ensures that all 5830 memory operations 5831 to local have 5832 completed before 5833 performing the 5834 atomicrmw that is 5835 being released. 5836 5837 2. buffer/global_atomic 5838 5839 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5840 2. s_waitcnt lgkmcnt(0) 5841 5842 - If OpenCL, omit. 5843 - Must happen before 5844 any following 5845 global/generic 5846 load/load 5847 atomic/store/store 5848 atomic/atomicrmw. 5849 - Ensures any 5850 following global 5851 data read is no 5852 older than the local load 5853 atomic value being 5854 acquired. 5855 5856 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5857 5858 - If OpenCL, omit. 5859 - Must happen after 5860 any preceding 5861 local/generic 5862 load/store/load 5863 atomic/store 5864 atomic/atomicrmw. 5865 - Must happen before 5866 the following 5867 atomicrmw. 5868 - Ensures that all 5869 memory operations 5870 to local have 5871 completed before 5872 performing the 5873 atomicrmw that is 5874 being released. 5875 5876 2. flat_atomic 5877 3. s_waitcnt lgkmcnt(0) 5878 5879 - If OpenCL, omit. 5880 - Must happen before 5881 any following 5882 global/generic 5883 load/load 5884 atomic/store/store 5885 atomic/atomicrmw. 5886 - Ensures any 5887 following global 5888 data read is no 5889 older than a local load 5890 atomic value being 5891 acquired. 5892 5893 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5894 - system vmcnt(0) 5895 5896 - If OpenCL, omit 5897 lgkmcnt(0). 5898 - Could be split into 5899 separate s_waitcnt 5900 vmcnt(0) and 5901 s_waitcnt 5902 lgkmcnt(0) to allow 5903 them to be 5904 independently moved 5905 according to the 5906 following rules. 5907 - s_waitcnt vmcnt(0) 5908 must happen after 5909 any preceding 5910 global/generic 5911 load/store/load 5912 atomic/store 5913 atomic/atomicrmw. 5914 - s_waitcnt lgkmcnt(0) 5915 must happen after 5916 any preceding 5917 local/generic 5918 load/store/load 5919 atomic/store 5920 atomic/atomicrmw. 5921 - Must happen before 5922 the following 5923 atomicrmw. 5924 - Ensures that all 5925 memory operations 5926 to global have 5927 completed before 5928 performing the 5929 atomicrmw that is 5930 being released. 5931 5932 2. buffer/global_atomic 5933 3. s_waitcnt vmcnt(0) 5934 5935 - Must happen before 5936 following 5937 buffer_wbinvl1_vol. 5938 - Ensures the 5939 atomicrmw has 5940 completed before 5941 invalidating the 5942 cache. 5943 5944 4. buffer_wbinvl1_vol 5945 5946 - Must happen before 5947 any following 5948 global/generic 5949 load/load 5950 atomic/atomicrmw. 5951 - Ensures that 5952 following loads 5953 will not see stale 5954 global data. 5955 5956 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5957 - system vmcnt(0) 5958 5959 - If OpenCL, omit 5960 lgkmcnt(0). 5961 - Could be split into 5962 separate s_waitcnt 5963 vmcnt(0) and 5964 s_waitcnt 5965 lgkmcnt(0) to allow 5966 them to be 5967 independently moved 5968 according to the 5969 following rules. 5970 - s_waitcnt vmcnt(0) 5971 must happen after 5972 any preceding 5973 global/generic 5974 load/store/load 5975 atomic/store 5976 atomic/atomicrmw. 5977 - s_waitcnt lgkmcnt(0) 5978 must happen after 5979 any preceding 5980 local/generic 5981 load/store/load 5982 atomic/store 5983 atomic/atomicrmw. 5984 - Must happen before 5985 the following 5986 atomicrmw. 5987 - Ensures that all 5988 memory operations 5989 to global have 5990 completed before 5991 performing the 5992 atomicrmw that is 5993 being released. 5994 5995 2. flat_atomic 5996 3. s_waitcnt vmcnt(0) & 5997 lgkmcnt(0) 5998 5999 - If OpenCL, omit 6000 lgkmcnt(0). 6001 - Must happen before 6002 following 6003 buffer_wbinvl1_vol. 6004 - Ensures the 6005 atomicrmw has 6006 completed before 6007 invalidating the 6008 cache. 6009 6010 4. buffer_wbinvl1_vol 6011 6012 - Must happen before 6013 any following 6014 global/generic 6015 load/load 6016 atomic/atomicrmw. 6017 - Ensures that 6018 following loads 6019 will not see stale 6020 global data. 6021 6022 fence acq_rel - singlethread *none* *none* 6023 - wavefront 6024 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 6025 6026 - If OpenCL and 6027 address space is 6028 not generic, omit. 6029 - However, 6030 since LLVM 6031 currently has no 6032 address space on 6033 the fence need to 6034 conservatively 6035 always generate 6036 (see comment for 6037 previous fence). 6038 - Must happen after 6039 any preceding 6040 local/generic 6041 load/load 6042 atomic/store/store 6043 atomic/atomicrmw. 6044 - Must happen before 6045 any following 6046 global/generic 6047 load/load 6048 atomic/store/store 6049 atomic/atomicrmw. 6050 - Ensures that all 6051 memory operations 6052 to local have 6053 completed before 6054 performing any 6055 following global 6056 memory operations. 6057 - Ensures that the 6058 preceding 6059 local/generic load 6060 atomic/atomicrmw 6061 with an equal or 6062 wider sync scope 6063 and memory ordering 6064 stronger than 6065 unordered (this is 6066 termed the 6067 acquire-fence-paired-atomic) 6068 has completed 6069 before following 6070 global memory 6071 operations. This 6072 satisfies the 6073 requirements of 6074 acquire. 6075 - Ensures that all 6076 previous memory 6077 operations have 6078 completed before a 6079 following 6080 local/generic store 6081 atomic/atomicrmw 6082 with an equal or 6083 wider sync scope 6084 and memory ordering 6085 stronger than 6086 unordered (this is 6087 termed the 6088 release-fence-paired-atomic). 6089 This satisfies the 6090 requirements of 6091 release. 6092 6093 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 6094 - system vmcnt(0) 6095 6096 - If OpenCL and 6097 address space is 6098 not generic, omit 6099 lgkmcnt(0). 6100 - However, since LLVM 6101 currently has no 6102 address space on 6103 the fence need to 6104 conservatively 6105 always generate 6106 (see comment for 6107 previous fence). 6108 - Could be split into 6109 separate s_waitcnt 6110 vmcnt(0) and 6111 s_waitcnt 6112 lgkmcnt(0) to allow 6113 them to be 6114 independently moved 6115 according to the 6116 following rules. 6117 - s_waitcnt vmcnt(0) 6118 must happen after 6119 any preceding 6120 global/generic 6121 load/store/load 6122 atomic/store 6123 atomic/atomicrmw. 6124 - s_waitcnt lgkmcnt(0) 6125 must happen after 6126 any preceding 6127 local/generic 6128 load/store/load 6129 atomic/store 6130 atomic/atomicrmw. 6131 - Must happen before 6132 the following 6133 buffer_wbinvl1_vol. 6134 - Ensures that the 6135 preceding 6136 global/local/generic 6137 load 6138 atomic/atomicrmw 6139 with an equal or 6140 wider sync scope 6141 and memory ordering 6142 stronger than 6143 unordered (this is 6144 termed the 6145 acquire-fence-paired-atomic) 6146 has completed 6147 before invalidating 6148 the cache. This 6149 satisfies the 6150 requirements of 6151 acquire. 6152 - Ensures that all 6153 previous memory 6154 operations have 6155 completed before a 6156 following 6157 global/local/generic 6158 store 6159 atomic/atomicrmw 6160 with an equal or 6161 wider sync scope 6162 and memory ordering 6163 stronger than 6164 unordered (this is 6165 termed the 6166 release-fence-paired-atomic). 6167 This satisfies the 6168 requirements of 6169 release. 6170 6171 2. buffer_wbinvl1_vol 6172 6173 - Must happen before 6174 any following 6175 global/generic 6176 load/load 6177 atomic/store/store 6178 atomic/atomicrmw. 6179 - Ensures that 6180 following loads 6181 will not see stale 6182 global data. This 6183 satisfies the 6184 requirements of 6185 acquire. 6186 6187 **Sequential Consistent Atomic** 6188 ------------------------------------------------------------------------------------ 6189 load atomic seq_cst - singlethread - global *Same as corresponding 6190 - wavefront - local load atomic acquire, 6191 - generic except must generate 6192 all instructions even 6193 for OpenCL.* 6194 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 6195 - generic 6196 6197 - Must 6198 happen after 6199 preceding 6200 local/generic load 6201 atomic/store 6202 atomic/atomicrmw 6203 with memory 6204 ordering of seq_cst 6205 and with equal or 6206 wider sync scope. 6207 (Note that seq_cst 6208 fences have their 6209 own s_waitcnt 6210 lgkmcnt(0) and so do 6211 not need to be 6212 considered.) 6213 - Ensures any 6214 preceding 6215 sequential 6216 consistent local 6217 memory instructions 6218 have completed 6219 before executing 6220 this sequentially 6221 consistent 6222 instruction. This 6223 prevents reordering 6224 a seq_cst store 6225 followed by a 6226 seq_cst load. (Note 6227 that seq_cst is 6228 stronger than 6229 acquire/release as 6230 the reordering of 6231 load acquire 6232 followed by a store 6233 release is 6234 prevented by the 6235 s_waitcnt of 6236 the release, but 6237 there is nothing 6238 preventing a store 6239 release followed by 6240 load acquire from 6241 completing out of 6242 order. The s_waitcnt 6243 could be placed after 6244 seq_store or before 6245 the seq_load. We 6246 choose the load to 6247 make the s_waitcnt be 6248 as late as possible 6249 so that the store 6250 may have already 6251 completed.) 6252 6253 2. *Following 6254 instructions same as 6255 corresponding load 6256 atomic acquire, 6257 except must generate 6258 all instructions even 6259 for OpenCL.* 6260 load atomic seq_cst - workgroup - local *Same as corresponding 6261 load atomic acquire, 6262 except must generate 6263 all instructions even 6264 for OpenCL.* 6265 6266 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 6267 - system - generic vmcnt(0) 6268 6269 - Could be split into 6270 separate s_waitcnt 6271 vmcnt(0) 6272 and s_waitcnt 6273 lgkmcnt(0) to allow 6274 them to be 6275 independently moved 6276 according to the 6277 following rules. 6278 - s_waitcnt lgkmcnt(0) 6279 must happen after 6280 preceding 6281 global/generic load 6282 atomic/store 6283 atomic/atomicrmw 6284 with memory 6285 ordering of seq_cst 6286 and with equal or 6287 wider sync scope. 6288 (Note that seq_cst 6289 fences have their 6290 own s_waitcnt 6291 lgkmcnt(0) and so do 6292 not need to be 6293 considered.) 6294 - s_waitcnt vmcnt(0) 6295 must happen after 6296 preceding 6297 global/generic load 6298 atomic/store 6299 atomic/atomicrmw 6300 with memory 6301 ordering of seq_cst 6302 and with equal or 6303 wider sync scope. 6304 (Note that seq_cst 6305 fences have their 6306 own s_waitcnt 6307 vmcnt(0) and so do 6308 not need to be 6309 considered.) 6310 - Ensures any 6311 preceding 6312 sequential 6313 consistent global 6314 memory instructions 6315 have completed 6316 before executing 6317 this sequentially 6318 consistent 6319 instruction. This 6320 prevents reordering 6321 a seq_cst store 6322 followed by a 6323 seq_cst load. (Note 6324 that seq_cst is 6325 stronger than 6326 acquire/release as 6327 the reordering of 6328 load acquire 6329 followed by a store 6330 release is 6331 prevented by the 6332 s_waitcnt of 6333 the release, but 6334 there is nothing 6335 preventing a store 6336 release followed by 6337 load acquire from 6338 completing out of 6339 order. The s_waitcnt 6340 could be placed after 6341 seq_store or before 6342 the seq_load. We 6343 choose the load to 6344 make the s_waitcnt be 6345 as late as possible 6346 so that the store 6347 may have already 6348 completed.) 6349 6350 2. *Following 6351 instructions same as 6352 corresponding load 6353 atomic acquire, 6354 except must generate 6355 all instructions even 6356 for OpenCL.* 6357 store atomic seq_cst - singlethread - global *Same as corresponding 6358 - wavefront - local store atomic release, 6359 - workgroup - generic except must generate 6360 - agent all instructions even 6361 - system for OpenCL.* 6362 atomicrmw seq_cst - singlethread - global *Same as corresponding 6363 - wavefront - local atomicrmw acq_rel, 6364 - workgroup - generic except must generate 6365 - agent all instructions even 6366 - system for OpenCL.* 6367 fence seq_cst - singlethread *none* *Same as corresponding 6368 - wavefront fence acq_rel, 6369 - workgroup except must generate 6370 - agent all instructions even 6371 - system for OpenCL.* 6372 ============ ============ ============== ========== ================================ 6373 6374.. _amdgpu-amdhsa-memory-model-gfx90a: 6375 6376Memory Model GFX90A 6377+++++++++++++++++++ 6378 6379For GFX90A: 6380 6381* Each agent has multiple shader arrays (SA). 6382* Each SA has multiple compute units (CU). 6383* Each CU has multiple SIMDs that execute wavefronts. 6384* The wavefronts for a single work-group are executed in the same CU but may be 6385 executed by different SIMDs. The exception is when in tgsplit execution mode 6386 when the wavefronts may be executed by different SIMDs in different CUs. 6387* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6388 executing on it. The exception is when in tgsplit execution mode when no LDS 6389 is allocated as wavefronts of the same work-group can be in different CUs. 6390* All LDS operations of a CU are performed as wavefront wide operations in a 6391 global order and involve no caching. Completion is reported to a wavefront in 6392 execution order. 6393* The LDS memory has multiple request queues shared by the SIMDs of a 6394 CU. Therefore, the LDS operations performed by different wavefronts of a 6395 work-group can be reordered relative to each other, which can result in 6396 reordering the visibility of vector memory operations with respect to LDS 6397 operations of other wavefronts in the same work-group. A ``s_waitcnt 6398 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6399 vector memory operations between wavefronts of a work-group, but not between 6400 operations performed by the same wavefront. 6401* The vector memory operations are performed as wavefront wide operations and 6402 completion is reported to a wavefront in execution order. The exception is 6403 that ``flat_load/store/atomic`` instructions can report out of vector memory 6404 order if they access LDS memory, and out of LDS operation order if they access 6405 global memory. 6406* The vector memory operations access a single vector L1 cache shared by all 6407 SIMDs a CU. Therefore: 6408 6409 * No special action is required for coherence between the lanes of a single 6410 wavefront. 6411 6412 * No special action is required for coherence between wavefronts in the same 6413 work-group since they execute on the same CU. The exception is when in 6414 tgsplit execution mode as wavefronts of the same work-group can be in 6415 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6416 the following item. 6417 6418 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6419 executing in different work-groups as they may be executing on different 6420 CUs. 6421 6422* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6423 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6424 scalar operations are used in a restricted way so do not impact the memory 6425 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6426* The vector and scalar memory operations use an L2 cache shared by all CUs on 6427 the same agent. 6428 6429 * The L2 cache has independent channels to service disjoint ranges of virtual 6430 addresses. 6431 * Each CU has a separate request queue per channel. Therefore, the vector and 6432 scalar memory operations performed by wavefronts executing in different 6433 work-groups (which may be executing on different CUs), or the same 6434 work-group if executing in tgsplit mode, of an agent can be reordered 6435 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6436 synchronization between vector memory operations of different CUs. It 6437 ensures a previous vector memory operation has completed before executing a 6438 subsequent vector memory or LDS operation and so can be used to meet the 6439 requirements of acquire and release. 6440 * The L2 cache of one agent can be kept coherent with other agents by: 6441 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 6442 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 6443 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 6444 6445 * Any local memory cache lines will be automatically invalidated by writes 6446 from CUs associated with other L2 caches, or writes from the CPU, due to 6447 the cache probe caused by coherent requests. Coherent requests are caused 6448 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6449 XGMI, and by PCIe requests that are configured to be coherent requests. 6450 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6451 Subsequent access from the GPU will automatically invalidate or writeback 6452 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6453 * Since all work-groups on the same agent share the same L2, no L2 6454 invalidation or writeback is required for coherence. 6455 * To ensure coherence of local and remote memory writes of work-groups in 6456 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 6457 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 6458 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 6459 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 6460 remote fine grain memory) bypasses the L2, so both will never result in 6461 dirty L2 cache lines. 6462 * To ensure coherence of local and remote memory reads of work-groups in 6463 different agents a ``buffer_invl2`` is required. It will invalidate L2 6464 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 6465 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 6466 coarse memory) cause local reads to be invalidated by remote writes with 6467 with the PTE C-bit so these cache lines are not invalidated. Note that 6468 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 6469 never result in L2 cache lines that need to be invalidated. 6470 6471 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6472 MTYPE UC (uncached) which bypasses the L2. 6473 6474Scalar memory operations are only used to access memory that is proven to not 6475change during the execution of the kernel dispatch. This includes constant 6476address space and global address space for program scope ``const`` variables. 6477Therefore, the kernel machine code does not have to maintain the scalar cache to 6478ensure it is coherent with the vector caches. The scalar and vector caches are 6479invalidated between kernel dispatches by CP since constant address space data 6480may change between kernel dispatch executions. See 6481:ref:`amdgpu-amdhsa-memory-spaces`. 6482 6483The one exception is if scalar writes are used to spill SGPR registers. In this 6484case the AMDGPU backend ensures the memory location used to spill is never 6485accessed by vector memory operations at the same time. If scalar writes are used 6486then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6487return since the locations may be used for vector memory instructions by a 6488future wavefront that uses the same scratch area, or a function call that 6489creates a frame at the same address, respectively. There is no need for a 6490``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6491 6492For kernarg backing memory: 6493 6494* CP invalidates the L1 cache at the start of each kernel dispatch. 6495* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6496 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6497 cache. This also causes it to be treated as non-volatile and so is not 6498 invalidated by ``*_vol``. 6499* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6500 so the L2 cache will be coherent with the CPU and other agents. 6501 6502Scratch backing memory (which is used for the private address space) is accessed 6503with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6504only accessed by a single thread, and is always write-before-read, there is 6505never a need to invalidate these entries from the L1 cache. Hence all cache 6506invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6507 6508The code sequences used to implement the memory model for GFX90A are defined 6509in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6510 6511 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6512 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6513 6514 ============ ============ ============== ========== ================================ 6515 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6516 Ordering Sync Scope Address GFX90A 6517 Space 6518 ============ ============ ============== ========== ================================ 6519 **Non-Atomic** 6520 ------------------------------------------------------------------------------------ 6521 load *none* *none* - global - !volatile & !nontemporal 6522 - generic 6523 - private 1. buffer/global/flat_load 6524 - constant 6525 - !volatile & nontemporal 6526 6527 1. buffer/global/flat_load 6528 glc=1 slc=1 6529 6530 - volatile 6531 6532 1. buffer/global/flat_load 6533 glc=1 6534 2. s_waitcnt vmcnt(0) 6535 6536 - Must happen before 6537 any following volatile 6538 global/generic 6539 load/store. 6540 - Ensures that 6541 volatile 6542 operations to 6543 different 6544 addresses will not 6545 be reordered by 6546 hardware. 6547 6548 load *none* *none* - local 1. ds_load 6549 store *none* *none* - global - !volatile & !nontemporal 6550 - generic 6551 - private 1. buffer/global/flat_store 6552 - constant 6553 - !volatile & nontemporal 6554 6555 1. buffer/global/flat_store 6556 glc=1 slc=1 6557 6558 - volatile 6559 6560 1. buffer/global/flat_store 6561 2. s_waitcnt vmcnt(0) 6562 6563 - Must happen before 6564 any following volatile 6565 global/generic 6566 load/store. 6567 - Ensures that 6568 volatile 6569 operations to 6570 different 6571 addresses will not 6572 be reordered by 6573 hardware. 6574 6575 store *none* *none* - local 1. ds_store 6576 **Unordered Atomic** 6577 ------------------------------------------------------------------------------------ 6578 load atomic unordered *any* *any* *Same as non-atomic*. 6579 store atomic unordered *any* *any* *Same as non-atomic*. 6580 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6581 **Monotonic Atomic** 6582 ------------------------------------------------------------------------------------ 6583 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6584 - wavefront - generic 6585 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6586 - generic glc=1 6587 6588 - If not TgSplit execution 6589 mode, omit glc=1. 6590 6591 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6592 - wavefront local address space cannot 6593 - workgroup be used.* 6594 6595 1. ds_load 6596 load atomic monotonic - agent - global 1. buffer/global/flat_load 6597 - generic glc=1 6598 load atomic monotonic - system - global 1. buffer/global/flat_load 6599 - generic glc=1 6600 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6601 - wavefront - generic 6602 - workgroup 6603 - agent 6604 store atomic monotonic - system - global 1. buffer/global/flat_store 6605 - generic 6606 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6607 - wavefront local address space cannot 6608 - workgroup be used.* 6609 6610 1. ds_store 6611 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6612 - wavefront - generic 6613 - workgroup 6614 - agent 6615 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6616 - generic 6617 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6618 - wavefront local address space cannot 6619 - workgroup be used.* 6620 6621 1. ds_atomic 6622 **Acquire Atomic** 6623 ------------------------------------------------------------------------------------ 6624 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6625 - wavefront - local 6626 - generic 6627 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6628 6629 - If not TgSplit execution 6630 mode, omit glc=1. 6631 6632 2. s_waitcnt vmcnt(0) 6633 6634 - If not TgSplit execution 6635 mode, omit. 6636 - Must happen before the 6637 following buffer_wbinvl1_vol. 6638 6639 3. buffer_wbinvl1_vol 6640 6641 - If not TgSplit execution 6642 mode, omit. 6643 - Must happen before 6644 any following 6645 global/generic 6646 load/load 6647 atomic/store/store 6648 atomic/atomicrmw. 6649 - Ensures that 6650 following 6651 loads will not see 6652 stale data. 6653 6654 load atomic acquire - workgroup - local *If TgSplit execution mode, 6655 local address space cannot 6656 be used.* 6657 6658 1. ds_load 6659 2. s_waitcnt lgkmcnt(0) 6660 6661 - If OpenCL, omit. 6662 - Must happen before 6663 any following 6664 global/generic 6665 load/load 6666 atomic/store/store 6667 atomic/atomicrmw. 6668 - Ensures any 6669 following global 6670 data read is no 6671 older than the local load 6672 atomic value being 6673 acquired. 6674 6675 load atomic acquire - workgroup - generic 1. flat_load glc=1 6676 6677 - If not TgSplit execution 6678 mode, omit glc=1. 6679 6680 2. s_waitcnt lgkm/vmcnt(0) 6681 6682 - Use lgkmcnt(0) if not 6683 TgSplit execution mode 6684 and vmcnt(0) if TgSplit 6685 execution mode. 6686 - If OpenCL, omit lgkmcnt(0). 6687 - Must happen before 6688 the following 6689 buffer_wbinvl1_vol and any 6690 following global/generic 6691 load/load 6692 atomic/store/store 6693 atomic/atomicrmw. 6694 - Ensures any 6695 following global 6696 data read is no 6697 older than a local load 6698 atomic value being 6699 acquired. 6700 6701 3. buffer_wbinvl1_vol 6702 6703 - If not TgSplit execution 6704 mode, omit. 6705 - Ensures that 6706 following 6707 loads will not see 6708 stale data. 6709 6710 load atomic acquire - agent - global 1. buffer/global_load 6711 glc=1 6712 2. s_waitcnt vmcnt(0) 6713 6714 - Must happen before 6715 following 6716 buffer_wbinvl1_vol. 6717 - Ensures the load 6718 has completed 6719 before invalidating 6720 the cache. 6721 6722 3. buffer_wbinvl1_vol 6723 6724 - Must happen before 6725 any following 6726 global/generic 6727 load/load 6728 atomic/atomicrmw. 6729 - Ensures that 6730 following 6731 loads will not see 6732 stale global data. 6733 6734 load atomic acquire - system - global 1. buffer/global/flat_load 6735 glc=1 6736 2. s_waitcnt vmcnt(0) 6737 6738 - Must happen before 6739 following buffer_invl2 and 6740 buffer_wbinvl1_vol. 6741 - Ensures the load 6742 has completed 6743 before invalidating 6744 the cache. 6745 6746 3. buffer_invl2; 6747 buffer_wbinvl1_vol 6748 6749 - Must happen before 6750 any following 6751 global/generic 6752 load/load 6753 atomic/atomicrmw. 6754 - Ensures that 6755 following 6756 loads will not see 6757 stale L1 global data, 6758 nor see stale L2 MTYPE 6759 NC global data. 6760 MTYPE RW and CC memory will 6761 never be stale in L2 due to 6762 the memory probes. 6763 6764 load atomic acquire - agent - generic 1. flat_load glc=1 6765 2. s_waitcnt vmcnt(0) & 6766 lgkmcnt(0) 6767 6768 - If TgSplit execution mode, 6769 omit lgkmcnt(0). 6770 - If OpenCL omit 6771 lgkmcnt(0). 6772 - Must happen before 6773 following 6774 buffer_wbinvl1_vol. 6775 - Ensures the flat_load 6776 has completed 6777 before invalidating 6778 the cache. 6779 6780 3. buffer_wbinvl1_vol 6781 6782 - Must happen before 6783 any following 6784 global/generic 6785 load/load 6786 atomic/atomicrmw. 6787 - Ensures that 6788 following loads 6789 will not see stale 6790 global data. 6791 6792 load atomic acquire - system - generic 1. flat_load glc=1 6793 2. s_waitcnt vmcnt(0) & 6794 lgkmcnt(0) 6795 6796 - If TgSplit execution mode, 6797 omit lgkmcnt(0). 6798 - If OpenCL omit 6799 lgkmcnt(0). 6800 - Must happen before 6801 following 6802 buffer_invl2 and 6803 buffer_wbinvl1_vol. 6804 - Ensures the flat_load 6805 has completed 6806 before invalidating 6807 the caches. 6808 6809 3. buffer_invl2; 6810 buffer_wbinvl1_vol 6811 6812 - Must happen before 6813 any following 6814 global/generic 6815 load/load 6816 atomic/atomicrmw. 6817 - Ensures that 6818 following 6819 loads will not see 6820 stale L1 global data, 6821 nor see stale L2 MTYPE 6822 NC global data. 6823 MTYPE RW and CC memory will 6824 never be stale in L2 due to 6825 the memory probes. 6826 6827 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6828 - wavefront - generic 6829 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6830 - wavefront local address space cannot 6831 be used.* 6832 6833 1. ds_atomic 6834 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6835 2. s_waitcnt vmcnt(0) 6836 6837 - If not TgSplit execution 6838 mode, omit. 6839 - Must happen before the 6840 following buffer_wbinvl1_vol. 6841 - Ensures the atomicrmw 6842 has completed 6843 before invalidating 6844 the cache. 6845 6846 3. buffer_wbinvl1_vol 6847 6848 - If not TgSplit execution 6849 mode, omit. 6850 - Must happen before 6851 any following 6852 global/generic 6853 load/load 6854 atomic/atomicrmw. 6855 - Ensures that 6856 following loads 6857 will not see stale 6858 global data. 6859 6860 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6861 local address space cannot 6862 be used.* 6863 6864 1. ds_atomic 6865 2. s_waitcnt lgkmcnt(0) 6866 6867 - If OpenCL, omit. 6868 - Must happen before 6869 any following 6870 global/generic 6871 load/load 6872 atomic/store/store 6873 atomic/atomicrmw. 6874 - Ensures any 6875 following global 6876 data read is no 6877 older than the local 6878 atomicrmw value 6879 being acquired. 6880 6881 atomicrmw acquire - workgroup - generic 1. flat_atomic 6882 2. s_waitcnt lgkm/vmcnt(0) 6883 6884 - Use lgkmcnt(0) if not 6885 TgSplit execution mode 6886 and vmcnt(0) if TgSplit 6887 execution mode. 6888 - If OpenCL, omit lgkmcnt(0). 6889 - Must happen before 6890 the following 6891 buffer_wbinvl1_vol and 6892 any following 6893 global/generic 6894 load/load 6895 atomic/store/store 6896 atomic/atomicrmw. 6897 - Ensures any 6898 following global 6899 data read is no 6900 older than a local 6901 atomicrmw value 6902 being acquired. 6903 6904 3. buffer_wbinvl1_vol 6905 6906 - If not TgSplit execution 6907 mode, omit. 6908 - Ensures that 6909 following 6910 loads will not see 6911 stale data. 6912 6913 atomicrmw acquire - agent - global 1. buffer/global_atomic 6914 2. s_waitcnt vmcnt(0) 6915 6916 - Must happen before 6917 following 6918 buffer_wbinvl1_vol. 6919 - Ensures the 6920 atomicrmw has 6921 completed before 6922 invalidating the 6923 cache. 6924 6925 3. buffer_wbinvl1_vol 6926 6927 - Must happen before 6928 any following 6929 global/generic 6930 load/load 6931 atomic/atomicrmw. 6932 - Ensures that 6933 following loads 6934 will not see stale 6935 global data. 6936 6937 atomicrmw acquire - system - global 1. buffer/global_atomic 6938 2. s_waitcnt vmcnt(0) 6939 6940 - Must happen before 6941 following buffer_invl2 and 6942 buffer_wbinvl1_vol. 6943 - Ensures the 6944 atomicrmw has 6945 completed before 6946 invalidating the 6947 caches. 6948 6949 3. buffer_invl2; 6950 buffer_wbinvl1_vol 6951 6952 - Must happen before 6953 any following 6954 global/generic 6955 load/load 6956 atomic/atomicrmw. 6957 - Ensures that 6958 following 6959 loads will not see 6960 stale L1 global data, 6961 nor see stale L2 MTYPE 6962 NC global data. 6963 MTYPE RW and CC memory will 6964 never be stale in L2 due to 6965 the memory probes. 6966 6967 atomicrmw acquire - agent - generic 1. flat_atomic 6968 2. s_waitcnt vmcnt(0) & 6969 lgkmcnt(0) 6970 6971 - If TgSplit execution mode, 6972 omit lgkmcnt(0). 6973 - If OpenCL, omit 6974 lgkmcnt(0). 6975 - Must happen before 6976 following 6977 buffer_wbinvl1_vol. 6978 - Ensures the 6979 atomicrmw has 6980 completed before 6981 invalidating the 6982 cache. 6983 6984 3. buffer_wbinvl1_vol 6985 6986 - Must happen before 6987 any following 6988 global/generic 6989 load/load 6990 atomic/atomicrmw. 6991 - Ensures that 6992 following loads 6993 will not see stale 6994 global data. 6995 6996 atomicrmw acquire - system - generic 1. flat_atomic 6997 2. s_waitcnt vmcnt(0) & 6998 lgkmcnt(0) 6999 7000 - If TgSplit execution mode, 7001 omit lgkmcnt(0). 7002 - If OpenCL, omit 7003 lgkmcnt(0). 7004 - Must happen before 7005 following 7006 buffer_invl2 and 7007 buffer_wbinvl1_vol. 7008 - Ensures the 7009 atomicrmw has 7010 completed before 7011 invalidating the 7012 caches. 7013 7014 3. buffer_invl2; 7015 buffer_wbinvl1_vol 7016 7017 - Must happen before 7018 any following 7019 global/generic 7020 load/load 7021 atomic/atomicrmw. 7022 - Ensures that 7023 following 7024 loads will not see 7025 stale L1 global data, 7026 nor see stale L2 MTYPE 7027 NC global data. 7028 MTYPE RW and CC memory will 7029 never be stale in L2 due to 7030 the memory probes. 7031 7032 fence acquire - singlethread *none* *none* 7033 - wavefront 7034 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7035 7036 - Use lgkmcnt(0) if not 7037 TgSplit execution mode 7038 and vmcnt(0) if TgSplit 7039 execution mode. 7040 - If OpenCL and 7041 address space is 7042 not generic, omit 7043 lgkmcnt(0). 7044 - If OpenCL and 7045 address space is 7046 local, omit 7047 vmcnt(0). 7048 - However, since LLVM 7049 currently has no 7050 address space on 7051 the fence need to 7052 conservatively 7053 always generate. If 7054 fence had an 7055 address space then 7056 set to address 7057 space of OpenCL 7058 fence flag, or to 7059 generic if both 7060 local and global 7061 flags are 7062 specified. 7063 - s_waitcnt vmcnt(0) 7064 must happen after 7065 any preceding 7066 global/generic load 7067 atomic/ 7068 atomicrmw 7069 with an equal or 7070 wider sync scope 7071 and memory ordering 7072 stronger than 7073 unordered (this is 7074 termed the 7075 fence-paired-atomic). 7076 - s_waitcnt lgkmcnt(0) 7077 must happen after 7078 any preceding 7079 local/generic load 7080 atomic/atomicrmw 7081 with an equal or 7082 wider sync scope 7083 and memory ordering 7084 stronger than 7085 unordered (this is 7086 termed the 7087 fence-paired-atomic). 7088 - Must happen before 7089 the following 7090 buffer_wbinvl1_vol and 7091 any following 7092 global/generic 7093 load/load 7094 atomic/store/store 7095 atomic/atomicrmw. 7096 - Ensures any 7097 following global 7098 data read is no 7099 older than the 7100 value read by the 7101 fence-paired-atomic. 7102 7103 2. buffer_wbinvl1_vol 7104 7105 - If not TgSplit execution 7106 mode, omit. 7107 - Ensures that 7108 following 7109 loads will not see 7110 stale data. 7111 7112 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 7113 vmcnt(0) 7114 7115 - If TgSplit execution mode, 7116 omit lgkmcnt(0). 7117 - If OpenCL and 7118 address space is 7119 not generic, omit 7120 lgkmcnt(0). 7121 - However, since LLVM 7122 currently has no 7123 address space on 7124 the fence need to 7125 conservatively 7126 always generate 7127 (see comment for 7128 previous fence). 7129 - Could be split into 7130 separate s_waitcnt 7131 vmcnt(0) and 7132 s_waitcnt 7133 lgkmcnt(0) to allow 7134 them to be 7135 independently moved 7136 according to the 7137 following rules. 7138 - s_waitcnt vmcnt(0) 7139 must happen after 7140 any preceding 7141 global/generic load 7142 atomic/atomicrmw 7143 with an equal or 7144 wider sync scope 7145 and memory ordering 7146 stronger than 7147 unordered (this is 7148 termed the 7149 fence-paired-atomic). 7150 - s_waitcnt lgkmcnt(0) 7151 must happen after 7152 any preceding 7153 local/generic load 7154 atomic/atomicrmw 7155 with an equal or 7156 wider sync scope 7157 and memory ordering 7158 stronger than 7159 unordered (this is 7160 termed the 7161 fence-paired-atomic). 7162 - Must happen before 7163 the following 7164 buffer_wbinvl1_vol. 7165 - Ensures that the 7166 fence-paired atomic 7167 has completed 7168 before invalidating 7169 the 7170 cache. Therefore 7171 any following 7172 locations read must 7173 be no older than 7174 the value read by 7175 the 7176 fence-paired-atomic. 7177 7178 2. buffer_wbinvl1_vol 7179 7180 - Must happen before any 7181 following global/generic 7182 load/load 7183 atomic/store/store 7184 atomic/atomicrmw. 7185 - Ensures that 7186 following loads 7187 will not see stale 7188 global data. 7189 7190 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 7191 vmcnt(0) 7192 7193 - If TgSplit execution mode, 7194 omit lgkmcnt(0). 7195 - If OpenCL and 7196 address space is 7197 not generic, omit 7198 lgkmcnt(0). 7199 - However, since LLVM 7200 currently has no 7201 address space on 7202 the fence need to 7203 conservatively 7204 always generate 7205 (see comment for 7206 previous fence). 7207 - Could be split into 7208 separate s_waitcnt 7209 vmcnt(0) and 7210 s_waitcnt 7211 lgkmcnt(0) to allow 7212 them to be 7213 independently moved 7214 according to the 7215 following rules. 7216 - s_waitcnt vmcnt(0) 7217 must happen after 7218 any preceding 7219 global/generic load 7220 atomic/atomicrmw 7221 with an equal or 7222 wider sync scope 7223 and memory ordering 7224 stronger than 7225 unordered (this is 7226 termed the 7227 fence-paired-atomic). 7228 - s_waitcnt lgkmcnt(0) 7229 must happen after 7230 any preceding 7231 local/generic load 7232 atomic/atomicrmw 7233 with an equal or 7234 wider sync scope 7235 and memory ordering 7236 stronger than 7237 unordered (this is 7238 termed the 7239 fence-paired-atomic). 7240 - Must happen before 7241 the following buffer_invl2 and 7242 buffer_wbinvl1_vol. 7243 - Ensures that the 7244 fence-paired atomic 7245 has completed 7246 before invalidating 7247 the 7248 cache. Therefore 7249 any following 7250 locations read must 7251 be no older than 7252 the value read by 7253 the 7254 fence-paired-atomic. 7255 7256 2. buffer_invl2; 7257 buffer_wbinvl1_vol 7258 7259 - Must happen before any 7260 following global/generic 7261 load/load 7262 atomic/store/store 7263 atomic/atomicrmw. 7264 - Ensures that 7265 following 7266 loads will not see 7267 stale L1 global data, 7268 nor see stale L2 MTYPE 7269 NC global data. 7270 MTYPE RW and CC memory will 7271 never be stale in L2 due to 7272 the memory probes. 7273 **Release Atomic** 7274 ------------------------------------------------------------------------------------ 7275 store atomic release - singlethread - global 1. buffer/global/flat_store 7276 - wavefront - generic 7277 store atomic release - singlethread - local *If TgSplit execution mode, 7278 - wavefront local address space cannot 7279 be used.* 7280 7281 1. ds_store 7282 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7283 - generic 7284 - Use lgkmcnt(0) if not 7285 TgSplit execution mode 7286 and vmcnt(0) if TgSplit 7287 execution mode. 7288 - If OpenCL, omit lgkmcnt(0). 7289 - s_waitcnt vmcnt(0) 7290 must happen after 7291 any preceding 7292 global/generic load/store/ 7293 load atomic/store atomic/ 7294 atomicrmw. 7295 - s_waitcnt lgkmcnt(0) 7296 must happen after 7297 any preceding 7298 local/generic 7299 load/store/load 7300 atomic/store 7301 atomic/atomicrmw. 7302 - Must happen before 7303 the following 7304 store. 7305 - Ensures that all 7306 memory operations 7307 have 7308 completed before 7309 performing the 7310 store that is being 7311 released. 7312 7313 2. buffer/global/flat_store 7314 store atomic release - workgroup - local *If TgSplit execution mode, 7315 local address space cannot 7316 be used.* 7317 7318 1. ds_store 7319 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 7320 - generic vmcnt(0) 7321 7322 - If TgSplit execution mode, 7323 omit lgkmcnt(0). 7324 - If OpenCL and 7325 address space is 7326 not generic, omit 7327 lgkmcnt(0). 7328 - Could be split into 7329 separate s_waitcnt 7330 vmcnt(0) and 7331 s_waitcnt 7332 lgkmcnt(0) to allow 7333 them to be 7334 independently moved 7335 according to the 7336 following rules. 7337 - s_waitcnt vmcnt(0) 7338 must happen after 7339 any preceding 7340 global/generic 7341 load/store/load 7342 atomic/store 7343 atomic/atomicrmw. 7344 - s_waitcnt lgkmcnt(0) 7345 must happen after 7346 any preceding 7347 local/generic 7348 load/store/load 7349 atomic/store 7350 atomic/atomicrmw. 7351 - Must happen before 7352 the following 7353 store. 7354 - Ensures that all 7355 memory operations 7356 to memory have 7357 completed before 7358 performing the 7359 store that is being 7360 released. 7361 7362 2. buffer/global/flat_store 7363 store atomic release - system - global 1. buffer_wbl2 7364 - generic 7365 - Must happen before 7366 following s_waitcnt. 7367 - Performs L2 writeback to 7368 ensure previous 7369 global/generic 7370 store/atomicrmw are 7371 visible at system scope. 7372 7373 2. s_waitcnt lgkmcnt(0) & 7374 vmcnt(0) 7375 7376 - If TgSplit execution mode, 7377 omit lgkmcnt(0). 7378 - If OpenCL and 7379 address space is 7380 not generic, omit 7381 lgkmcnt(0). 7382 - Could be split into 7383 separate s_waitcnt 7384 vmcnt(0) and 7385 s_waitcnt 7386 lgkmcnt(0) to allow 7387 them to be 7388 independently moved 7389 according to the 7390 following rules. 7391 - s_waitcnt vmcnt(0) 7392 must happen after any 7393 preceding 7394 global/generic 7395 load/store/load 7396 atomic/store 7397 atomic/atomicrmw. 7398 - s_waitcnt lgkmcnt(0) 7399 must happen after any 7400 preceding 7401 local/generic 7402 load/store/load 7403 atomic/store 7404 atomic/atomicrmw. 7405 - Must happen before 7406 the following 7407 store. 7408 - Ensures that all 7409 memory operations 7410 to memory and the L2 7411 writeback have 7412 completed before 7413 performing the 7414 store that is being 7415 released. 7416 7417 3. buffer/global/flat_store 7418 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7419 - wavefront - generic 7420 atomicrmw release - singlethread - local *If TgSplit execution mode, 7421 - wavefront local address space cannot 7422 be used.* 7423 7424 1. ds_atomic 7425 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7426 - generic 7427 - Use lgkmcnt(0) if not 7428 TgSplit execution mode 7429 and vmcnt(0) if TgSplit 7430 execution mode. 7431 - If OpenCL, omit 7432 lgkmcnt(0). 7433 - s_waitcnt vmcnt(0) 7434 must happen after 7435 any preceding 7436 global/generic load/store/ 7437 load atomic/store atomic/ 7438 atomicrmw. 7439 - s_waitcnt lgkmcnt(0) 7440 must happen after 7441 any preceding 7442 local/generic 7443 load/store/load 7444 atomic/store 7445 atomic/atomicrmw. 7446 - Must happen before 7447 the following 7448 atomicrmw. 7449 - Ensures that all 7450 memory operations 7451 have 7452 completed before 7453 performing the 7454 atomicrmw that is 7455 being released. 7456 7457 2. buffer/global/flat_atomic 7458 atomicrmw release - workgroup - local *If TgSplit execution mode, 7459 local address space cannot 7460 be used.* 7461 7462 1. ds_atomic 7463 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7464 - generic vmcnt(0) 7465 7466 - If TgSplit execution mode, 7467 omit lgkmcnt(0). 7468 - If OpenCL, omit 7469 lgkmcnt(0). 7470 - Could be split into 7471 separate s_waitcnt 7472 vmcnt(0) and 7473 s_waitcnt 7474 lgkmcnt(0) to allow 7475 them to be 7476 independently moved 7477 according to the 7478 following rules. 7479 - s_waitcnt vmcnt(0) 7480 must happen after 7481 any preceding 7482 global/generic 7483 load/store/load 7484 atomic/store 7485 atomic/atomicrmw. 7486 - s_waitcnt lgkmcnt(0) 7487 must happen after 7488 any preceding 7489 local/generic 7490 load/store/load 7491 atomic/store 7492 atomic/atomicrmw. 7493 - Must happen before 7494 the following 7495 atomicrmw. 7496 - Ensures that all 7497 memory operations 7498 to global and local 7499 have completed 7500 before performing 7501 the atomicrmw that 7502 is being released. 7503 7504 2. buffer/global/flat_atomic 7505 atomicrmw release - system - global 1. buffer_wbl2 7506 - generic 7507 - Must happen before 7508 following s_waitcnt. 7509 - Performs L2 writeback to 7510 ensure previous 7511 global/generic 7512 store/atomicrmw are 7513 visible at system scope. 7514 7515 2. s_waitcnt lgkmcnt(0) & 7516 vmcnt(0) 7517 7518 - If TgSplit execution mode, 7519 omit lgkmcnt(0). 7520 - If OpenCL, omit 7521 lgkmcnt(0). 7522 - Could be split into 7523 separate s_waitcnt 7524 vmcnt(0) and 7525 s_waitcnt 7526 lgkmcnt(0) to allow 7527 them to be 7528 independently moved 7529 according to the 7530 following rules. 7531 - s_waitcnt vmcnt(0) 7532 must happen after 7533 any preceding 7534 global/generic 7535 load/store/load 7536 atomic/store 7537 atomic/atomicrmw. 7538 - s_waitcnt lgkmcnt(0) 7539 must happen after 7540 any preceding 7541 local/generic 7542 load/store/load 7543 atomic/store 7544 atomic/atomicrmw. 7545 - Must happen before 7546 the following 7547 atomicrmw. 7548 - Ensures that all 7549 memory operations 7550 to memory and the L2 7551 writeback have 7552 completed before 7553 performing the 7554 store that is being 7555 released. 7556 7557 3. buffer/global/flat_atomic 7558 fence release - singlethread *none* *none* 7559 - wavefront 7560 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7561 7562 - Use lgkmcnt(0) if not 7563 TgSplit execution mode 7564 and vmcnt(0) if TgSplit 7565 execution mode. 7566 - If OpenCL and 7567 address space is 7568 not generic, omit 7569 lgkmcnt(0). 7570 - If OpenCL and 7571 address space is 7572 local, omit 7573 vmcnt(0). 7574 - However, since LLVM 7575 currently has no 7576 address space on 7577 the fence need to 7578 conservatively 7579 always generate. If 7580 fence had an 7581 address space then 7582 set to address 7583 space of OpenCL 7584 fence flag, or to 7585 generic if both 7586 local and global 7587 flags are 7588 specified. 7589 - s_waitcnt vmcnt(0) 7590 must happen after 7591 any preceding 7592 global/generic 7593 load/store/ 7594 load atomic/store atomic/ 7595 atomicrmw. 7596 - s_waitcnt lgkmcnt(0) 7597 must happen after 7598 any preceding 7599 local/generic 7600 load/load 7601 atomic/store/store 7602 atomic/atomicrmw. 7603 - Must happen before 7604 any following store 7605 atomic/atomicrmw 7606 with an equal or 7607 wider sync scope 7608 and memory ordering 7609 stronger than 7610 unordered (this is 7611 termed the 7612 fence-paired-atomic). 7613 - Ensures that all 7614 memory operations 7615 have 7616 completed before 7617 performing the 7618 following 7619 fence-paired-atomic. 7620 7621 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7622 vmcnt(0) 7623 7624 - If TgSplit execution mode, 7625 omit lgkmcnt(0). 7626 - If OpenCL and 7627 address space is 7628 not generic, omit 7629 lgkmcnt(0). 7630 - If OpenCL and 7631 address space is 7632 local, omit 7633 vmcnt(0). 7634 - However, since LLVM 7635 currently has no 7636 address space on 7637 the fence need to 7638 conservatively 7639 always generate. If 7640 fence had an 7641 address space then 7642 set to address 7643 space of OpenCL 7644 fence flag, or to 7645 generic if both 7646 local and global 7647 flags are 7648 specified. 7649 - Could be split into 7650 separate s_waitcnt 7651 vmcnt(0) and 7652 s_waitcnt 7653 lgkmcnt(0) to allow 7654 them to be 7655 independently moved 7656 according to the 7657 following rules. 7658 - s_waitcnt vmcnt(0) 7659 must happen after 7660 any preceding 7661 global/generic 7662 load/store/load 7663 atomic/store 7664 atomic/atomicrmw. 7665 - s_waitcnt lgkmcnt(0) 7666 must happen after 7667 any preceding 7668 local/generic 7669 load/store/load 7670 atomic/store 7671 atomic/atomicrmw. 7672 - Must happen before 7673 any following store 7674 atomic/atomicrmw 7675 with an equal or 7676 wider sync scope 7677 and memory ordering 7678 stronger than 7679 unordered (this is 7680 termed the 7681 fence-paired-atomic). 7682 - Ensures that all 7683 memory operations 7684 have 7685 completed before 7686 performing the 7687 following 7688 fence-paired-atomic. 7689 7690 fence release - system *none* 1. buffer_wbl2 7691 7692 - If OpenCL and 7693 address space is 7694 local, omit. 7695 - Must happen before 7696 following s_waitcnt. 7697 - Performs L2 writeback to 7698 ensure previous 7699 global/generic 7700 store/atomicrmw are 7701 visible at system scope. 7702 7703 2. s_waitcnt lgkmcnt(0) & 7704 vmcnt(0) 7705 7706 - If TgSplit execution mode, 7707 omit lgkmcnt(0). 7708 - If OpenCL and 7709 address space is 7710 not generic, omit 7711 lgkmcnt(0). 7712 - If OpenCL and 7713 address space is 7714 local, omit 7715 vmcnt(0). 7716 - However, since LLVM 7717 currently has no 7718 address space on 7719 the fence need to 7720 conservatively 7721 always generate. If 7722 fence had an 7723 address space then 7724 set to address 7725 space of OpenCL 7726 fence flag, or to 7727 generic if both 7728 local and global 7729 flags are 7730 specified. 7731 - Could be split into 7732 separate s_waitcnt 7733 vmcnt(0) and 7734 s_waitcnt 7735 lgkmcnt(0) to allow 7736 them to be 7737 independently moved 7738 according to the 7739 following rules. 7740 - s_waitcnt vmcnt(0) 7741 must happen after 7742 any preceding 7743 global/generic 7744 load/store/load 7745 atomic/store 7746 atomic/atomicrmw. 7747 - s_waitcnt lgkmcnt(0) 7748 must happen after 7749 any preceding 7750 local/generic 7751 load/store/load 7752 atomic/store 7753 atomic/atomicrmw. 7754 - Must happen before 7755 any following store 7756 atomic/atomicrmw 7757 with an equal or 7758 wider sync scope 7759 and memory ordering 7760 stronger than 7761 unordered (this is 7762 termed the 7763 fence-paired-atomic). 7764 - Ensures that all 7765 memory operations 7766 have 7767 completed before 7768 performing the 7769 following 7770 fence-paired-atomic. 7771 7772 **Acquire-Release Atomic** 7773 ------------------------------------------------------------------------------------ 7774 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7775 - wavefront - generic 7776 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7777 - wavefront local address space cannot 7778 be used.* 7779 7780 1. ds_atomic 7781 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7782 7783 - Use lgkmcnt(0) if not 7784 TgSplit execution mode 7785 and vmcnt(0) if TgSplit 7786 execution mode. 7787 - If OpenCL, omit 7788 lgkmcnt(0). 7789 - Must happen after 7790 any preceding 7791 local/generic 7792 load/store/load 7793 atomic/store 7794 atomic/atomicrmw. 7795 - s_waitcnt vmcnt(0) 7796 must happen after 7797 any preceding 7798 global/generic load/store/ 7799 load atomic/store atomic/ 7800 atomicrmw. 7801 - s_waitcnt lgkmcnt(0) 7802 must happen after 7803 any preceding 7804 local/generic 7805 load/store/load 7806 atomic/store 7807 atomic/atomicrmw. 7808 - Must happen before 7809 the following 7810 atomicrmw. 7811 - Ensures that all 7812 memory operations 7813 have 7814 completed before 7815 performing the 7816 atomicrmw that is 7817 being released. 7818 7819 2. buffer/global_atomic 7820 3. s_waitcnt vmcnt(0) 7821 7822 - If not TgSplit execution 7823 mode, omit. 7824 - Must happen before 7825 the following 7826 buffer_wbinvl1_vol. 7827 - Ensures any 7828 following global 7829 data read is no 7830 older than the 7831 atomicrmw value 7832 being acquired. 7833 7834 4. buffer_wbinvl1_vol 7835 7836 - If not TgSplit execution 7837 mode, omit. 7838 - Ensures that 7839 following 7840 loads will not see 7841 stale data. 7842 7843 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7844 local address space cannot 7845 be used.* 7846 7847 1. ds_atomic 7848 2. s_waitcnt lgkmcnt(0) 7849 7850 - If OpenCL, omit. 7851 - Must happen before 7852 any following 7853 global/generic 7854 load/load 7855 atomic/store/store 7856 atomic/atomicrmw. 7857 - Ensures any 7858 following global 7859 data read is no 7860 older than the local load 7861 atomic value being 7862 acquired. 7863 7864 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7865 7866 - Use lgkmcnt(0) if not 7867 TgSplit execution mode 7868 and vmcnt(0) if TgSplit 7869 execution mode. 7870 - If OpenCL, omit 7871 lgkmcnt(0). 7872 - s_waitcnt vmcnt(0) 7873 must happen after 7874 any preceding 7875 global/generic load/store/ 7876 load atomic/store atomic/ 7877 atomicrmw. 7878 - s_waitcnt lgkmcnt(0) 7879 must happen after 7880 any preceding 7881 local/generic 7882 load/store/load 7883 atomic/store 7884 atomic/atomicrmw. 7885 - Must happen before 7886 the following 7887 atomicrmw. 7888 - Ensures that all 7889 memory operations 7890 have 7891 completed before 7892 performing the 7893 atomicrmw that is 7894 being released. 7895 7896 2. flat_atomic 7897 3. s_waitcnt lgkmcnt(0) & 7898 vmcnt(0) 7899 7900 - If not TgSplit execution 7901 mode, omit vmcnt(0). 7902 - If OpenCL, omit 7903 lgkmcnt(0). 7904 - Must happen before 7905 the following 7906 buffer_wbinvl1_vol and 7907 any following 7908 global/generic 7909 load/load 7910 atomic/store/store 7911 atomic/atomicrmw. 7912 - Ensures any 7913 following global 7914 data read is no 7915 older than a local load 7916 atomic value being 7917 acquired. 7918 7919 3. buffer_wbinvl1_vol 7920 7921 - If not TgSplit execution 7922 mode, omit. 7923 - Ensures that 7924 following 7925 loads will not see 7926 stale data. 7927 7928 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7929 vmcnt(0) 7930 7931 - If TgSplit execution mode, 7932 omit lgkmcnt(0). 7933 - If OpenCL, omit 7934 lgkmcnt(0). 7935 - Could be split into 7936 separate s_waitcnt 7937 vmcnt(0) and 7938 s_waitcnt 7939 lgkmcnt(0) to allow 7940 them to be 7941 independently moved 7942 according to the 7943 following rules. 7944 - s_waitcnt vmcnt(0) 7945 must happen after 7946 any preceding 7947 global/generic 7948 load/store/load 7949 atomic/store 7950 atomic/atomicrmw. 7951 - s_waitcnt lgkmcnt(0) 7952 must happen after 7953 any preceding 7954 local/generic 7955 load/store/load 7956 atomic/store 7957 atomic/atomicrmw. 7958 - Must happen before 7959 the following 7960 atomicrmw. 7961 - Ensures that all 7962 memory operations 7963 to global have 7964 completed before 7965 performing the 7966 atomicrmw that is 7967 being released. 7968 7969 2. buffer/global_atomic 7970 3. s_waitcnt vmcnt(0) 7971 7972 - Must happen before 7973 following 7974 buffer_wbinvl1_vol. 7975 - Ensures the 7976 atomicrmw has 7977 completed before 7978 invalidating the 7979 cache. 7980 7981 4. buffer_wbinvl1_vol 7982 7983 - Must happen before 7984 any following 7985 global/generic 7986 load/load 7987 atomic/atomicrmw. 7988 - Ensures that 7989 following loads 7990 will not see stale 7991 global data. 7992 7993 atomicrmw acq_rel - system - global 1. buffer_wbl2 7994 7995 - Must happen before 7996 following s_waitcnt. 7997 - Performs L2 writeback to 7998 ensure previous 7999 global/generic 8000 store/atomicrmw are 8001 visible at system scope. 8002 8003 2. s_waitcnt lgkmcnt(0) & 8004 vmcnt(0) 8005 8006 - If TgSplit execution mode, 8007 omit lgkmcnt(0). 8008 - If OpenCL, omit 8009 lgkmcnt(0). 8010 - Could be split into 8011 separate s_waitcnt 8012 vmcnt(0) and 8013 s_waitcnt 8014 lgkmcnt(0) to allow 8015 them to be 8016 independently moved 8017 according to the 8018 following rules. 8019 - s_waitcnt vmcnt(0) 8020 must happen after 8021 any preceding 8022 global/generic 8023 load/store/load 8024 atomic/store 8025 atomic/atomicrmw. 8026 - s_waitcnt lgkmcnt(0) 8027 must happen after 8028 any preceding 8029 local/generic 8030 load/store/load 8031 atomic/store 8032 atomic/atomicrmw. 8033 - Must happen before 8034 the following 8035 atomicrmw. 8036 - Ensures that all 8037 memory operations 8038 to global and L2 writeback 8039 have completed before 8040 performing the 8041 atomicrmw that is 8042 being released. 8043 8044 3. buffer/global_atomic 8045 4. s_waitcnt vmcnt(0) 8046 8047 - Must happen before 8048 following buffer_invl2 and 8049 buffer_wbinvl1_vol. 8050 - Ensures the 8051 atomicrmw has 8052 completed before 8053 invalidating the 8054 caches. 8055 8056 5. buffer_invl2; 8057 buffer_wbinvl1_vol 8058 8059 - Must happen before 8060 any following 8061 global/generic 8062 load/load 8063 atomic/atomicrmw. 8064 - Ensures that 8065 following 8066 loads will not see 8067 stale L1 global data, 8068 nor see stale L2 MTYPE 8069 NC global data. 8070 MTYPE RW and CC memory will 8071 never be stale in L2 due to 8072 the memory probes. 8073 8074 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 8075 vmcnt(0) 8076 8077 - If TgSplit execution mode, 8078 omit lgkmcnt(0). 8079 - If OpenCL, omit 8080 lgkmcnt(0). 8081 - Could be split into 8082 separate s_waitcnt 8083 vmcnt(0) and 8084 s_waitcnt 8085 lgkmcnt(0) to allow 8086 them to be 8087 independently moved 8088 according to the 8089 following rules. 8090 - s_waitcnt vmcnt(0) 8091 must happen after 8092 any preceding 8093 global/generic 8094 load/store/load 8095 atomic/store 8096 atomic/atomicrmw. 8097 - s_waitcnt lgkmcnt(0) 8098 must happen after 8099 any preceding 8100 local/generic 8101 load/store/load 8102 atomic/store 8103 atomic/atomicrmw. 8104 - Must happen before 8105 the following 8106 atomicrmw. 8107 - Ensures that all 8108 memory operations 8109 to global have 8110 completed before 8111 performing the 8112 atomicrmw that is 8113 being released. 8114 8115 2. flat_atomic 8116 3. s_waitcnt vmcnt(0) & 8117 lgkmcnt(0) 8118 8119 - If TgSplit execution mode, 8120 omit lgkmcnt(0). 8121 - If OpenCL, omit 8122 lgkmcnt(0). 8123 - Must happen before 8124 following 8125 buffer_wbinvl1_vol. 8126 - Ensures the 8127 atomicrmw has 8128 completed before 8129 invalidating the 8130 cache. 8131 8132 4. buffer_wbinvl1_vol 8133 8134 - Must happen before 8135 any following 8136 global/generic 8137 load/load 8138 atomic/atomicrmw. 8139 - Ensures that 8140 following loads 8141 will not see stale 8142 global data. 8143 8144 atomicrmw acq_rel - system - generic 1. buffer_wbl2 8145 8146 - Must happen before 8147 following s_waitcnt. 8148 - Performs L2 writeback to 8149 ensure previous 8150 global/generic 8151 store/atomicrmw are 8152 visible at system scope. 8153 8154 2. s_waitcnt lgkmcnt(0) & 8155 vmcnt(0) 8156 8157 - If TgSplit execution mode, 8158 omit lgkmcnt(0). 8159 - If OpenCL, omit 8160 lgkmcnt(0). 8161 - Could be split into 8162 separate s_waitcnt 8163 vmcnt(0) and 8164 s_waitcnt 8165 lgkmcnt(0) to allow 8166 them to be 8167 independently moved 8168 according to the 8169 following rules. 8170 - s_waitcnt vmcnt(0) 8171 must happen after 8172 any preceding 8173 global/generic 8174 load/store/load 8175 atomic/store 8176 atomic/atomicrmw. 8177 - s_waitcnt lgkmcnt(0) 8178 must happen after 8179 any preceding 8180 local/generic 8181 load/store/load 8182 atomic/store 8183 atomic/atomicrmw. 8184 - Must happen before 8185 the following 8186 atomicrmw. 8187 - Ensures that all 8188 memory operations 8189 to global and L2 writeback 8190 have completed before 8191 performing the 8192 atomicrmw that is 8193 being released. 8194 8195 3. flat_atomic 8196 4. s_waitcnt vmcnt(0) & 8197 lgkmcnt(0) 8198 8199 - If TgSplit execution mode, 8200 omit lgkmcnt(0). 8201 - If OpenCL, omit 8202 lgkmcnt(0). 8203 - Must happen before 8204 following buffer_invl2 and 8205 buffer_wbinvl1_vol. 8206 - Ensures the 8207 atomicrmw has 8208 completed before 8209 invalidating the 8210 caches. 8211 8212 5. buffer_invl2; 8213 buffer_wbinvl1_vol 8214 8215 - Must happen before 8216 any following 8217 global/generic 8218 load/load 8219 atomic/atomicrmw. 8220 - Ensures that 8221 following 8222 loads will not see 8223 stale L1 global data, 8224 nor see stale L2 MTYPE 8225 NC global data. 8226 MTYPE RW and CC memory will 8227 never be stale in L2 due to 8228 the memory probes. 8229 8230 fence acq_rel - singlethread *none* *none* 8231 - wavefront 8232 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 8233 8234 - Use lgkmcnt(0) if not 8235 TgSplit execution mode 8236 and vmcnt(0) if TgSplit 8237 execution mode. 8238 - If OpenCL and 8239 address space is 8240 not generic, omit 8241 lgkmcnt(0). 8242 - If OpenCL and 8243 address space is 8244 local, omit 8245 vmcnt(0). 8246 - However, 8247 since LLVM 8248 currently has no 8249 address space on 8250 the fence need to 8251 conservatively 8252 always generate 8253 (see comment for 8254 previous fence). 8255 - s_waitcnt vmcnt(0) 8256 must happen after 8257 any preceding 8258 global/generic 8259 load/store/ 8260 load atomic/store atomic/ 8261 atomicrmw. 8262 - s_waitcnt lgkmcnt(0) 8263 must happen after 8264 any preceding 8265 local/generic 8266 load/load 8267 atomic/store/store 8268 atomic/atomicrmw. 8269 - Must happen before 8270 any following 8271 global/generic 8272 load/load 8273 atomic/store/store 8274 atomic/atomicrmw. 8275 - Ensures that all 8276 memory operations 8277 have 8278 completed before 8279 performing any 8280 following global 8281 memory operations. 8282 - Ensures that the 8283 preceding 8284 local/generic load 8285 atomic/atomicrmw 8286 with an equal or 8287 wider sync scope 8288 and memory ordering 8289 stronger than 8290 unordered (this is 8291 termed the 8292 acquire-fence-paired-atomic) 8293 has completed 8294 before following 8295 global memory 8296 operations. This 8297 satisfies the 8298 requirements of 8299 acquire. 8300 - Ensures that all 8301 previous memory 8302 operations have 8303 completed before a 8304 following 8305 local/generic store 8306 atomic/atomicrmw 8307 with an equal or 8308 wider sync scope 8309 and memory ordering 8310 stronger than 8311 unordered (this is 8312 termed the 8313 release-fence-paired-atomic). 8314 This satisfies the 8315 requirements of 8316 release. 8317 - Must happen before 8318 the following 8319 buffer_wbinvl1_vol. 8320 - Ensures that the 8321 acquire-fence-paired 8322 atomic has completed 8323 before invalidating 8324 the 8325 cache. Therefore 8326 any following 8327 locations read must 8328 be no older than 8329 the value read by 8330 the 8331 acquire-fence-paired-atomic. 8332 8333 2. buffer_wbinvl1_vol 8334 8335 - If not TgSplit execution 8336 mode, omit. 8337 - Ensures that 8338 following 8339 loads will not see 8340 stale data. 8341 8342 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 8343 vmcnt(0) 8344 8345 - If TgSplit execution mode, 8346 omit lgkmcnt(0). 8347 - If OpenCL and 8348 address space is 8349 not generic, omit 8350 lgkmcnt(0). 8351 - However, since LLVM 8352 currently has no 8353 address space on 8354 the fence need to 8355 conservatively 8356 always generate 8357 (see comment for 8358 previous fence). 8359 - Could be split into 8360 separate s_waitcnt 8361 vmcnt(0) and 8362 s_waitcnt 8363 lgkmcnt(0) to allow 8364 them to be 8365 independently moved 8366 according to the 8367 following rules. 8368 - s_waitcnt vmcnt(0) 8369 must happen after 8370 any preceding 8371 global/generic 8372 load/store/load 8373 atomic/store 8374 atomic/atomicrmw. 8375 - s_waitcnt lgkmcnt(0) 8376 must happen after 8377 any preceding 8378 local/generic 8379 load/store/load 8380 atomic/store 8381 atomic/atomicrmw. 8382 - Must happen before 8383 the following 8384 buffer_wbinvl1_vol. 8385 - Ensures that the 8386 preceding 8387 global/local/generic 8388 load 8389 atomic/atomicrmw 8390 with an equal or 8391 wider sync scope 8392 and memory ordering 8393 stronger than 8394 unordered (this is 8395 termed the 8396 acquire-fence-paired-atomic) 8397 has completed 8398 before invalidating 8399 the cache. This 8400 satisfies the 8401 requirements of 8402 acquire. 8403 - Ensures that all 8404 previous memory 8405 operations have 8406 completed before a 8407 following 8408 global/local/generic 8409 store 8410 atomic/atomicrmw 8411 with an equal or 8412 wider sync scope 8413 and memory ordering 8414 stronger than 8415 unordered (this is 8416 termed the 8417 release-fence-paired-atomic). 8418 This satisfies the 8419 requirements of 8420 release. 8421 8422 2. buffer_wbinvl1_vol 8423 8424 - Must happen before 8425 any following 8426 global/generic 8427 load/load 8428 atomic/store/store 8429 atomic/atomicrmw. 8430 - Ensures that 8431 following loads 8432 will not see stale 8433 global data. This 8434 satisfies the 8435 requirements of 8436 acquire. 8437 8438 fence acq_rel - system *none* 1. buffer_wbl2 8439 8440 - If OpenCL and 8441 address space is 8442 local, omit. 8443 - Must happen before 8444 following s_waitcnt. 8445 - Performs L2 writeback to 8446 ensure previous 8447 global/generic 8448 store/atomicrmw are 8449 visible at system scope. 8450 8451 2. s_waitcnt lgkmcnt(0) & 8452 vmcnt(0) 8453 8454 - If TgSplit execution mode, 8455 omit lgkmcnt(0). 8456 - If OpenCL and 8457 address space is 8458 not generic, omit 8459 lgkmcnt(0). 8460 - However, since LLVM 8461 currently has no 8462 address space on 8463 the fence need to 8464 conservatively 8465 always generate 8466 (see comment for 8467 previous fence). 8468 - Could be split into 8469 separate s_waitcnt 8470 vmcnt(0) and 8471 s_waitcnt 8472 lgkmcnt(0) to allow 8473 them to be 8474 independently moved 8475 according to the 8476 following rules. 8477 - s_waitcnt vmcnt(0) 8478 must happen after 8479 any preceding 8480 global/generic 8481 load/store/load 8482 atomic/store 8483 atomic/atomicrmw. 8484 - s_waitcnt lgkmcnt(0) 8485 must happen after 8486 any preceding 8487 local/generic 8488 load/store/load 8489 atomic/store 8490 atomic/atomicrmw. 8491 - Must happen before 8492 the following buffer_invl2 and 8493 buffer_wbinvl1_vol. 8494 - Ensures that the 8495 preceding 8496 global/local/generic 8497 load 8498 atomic/atomicrmw 8499 with an equal or 8500 wider sync scope 8501 and memory ordering 8502 stronger than 8503 unordered (this is 8504 termed the 8505 acquire-fence-paired-atomic) 8506 has completed 8507 before invalidating 8508 the cache. This 8509 satisfies the 8510 requirements of 8511 acquire. 8512 - Ensures that all 8513 previous memory 8514 operations have 8515 completed before a 8516 following 8517 global/local/generic 8518 store 8519 atomic/atomicrmw 8520 with an equal or 8521 wider sync scope 8522 and memory ordering 8523 stronger than 8524 unordered (this is 8525 termed the 8526 release-fence-paired-atomic). 8527 This satisfies the 8528 requirements of 8529 release. 8530 8531 3. buffer_invl2; 8532 buffer_wbinvl1_vol 8533 8534 - Must happen before 8535 any following 8536 global/generic 8537 load/load 8538 atomic/store/store 8539 atomic/atomicrmw. 8540 - Ensures that 8541 following 8542 loads will not see 8543 stale L1 global data, 8544 nor see stale L2 MTYPE 8545 NC global data. 8546 MTYPE RW and CC memory will 8547 never be stale in L2 due to 8548 the memory probes. 8549 8550 **Sequential Consistent Atomic** 8551 ------------------------------------------------------------------------------------ 8552 load atomic seq_cst - singlethread - global *Same as corresponding 8553 - wavefront - local load atomic acquire, 8554 - generic except must generate 8555 all instructions even 8556 for OpenCL.* 8557 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8558 - generic 8559 - Use lgkmcnt(0) if not 8560 TgSplit execution mode 8561 and vmcnt(0) if TgSplit 8562 execution mode. 8563 - s_waitcnt lgkmcnt(0) must 8564 happen after 8565 preceding 8566 local/generic load 8567 atomic/store 8568 atomic/atomicrmw 8569 with memory 8570 ordering of seq_cst 8571 and with equal or 8572 wider sync scope. 8573 (Note that seq_cst 8574 fences have their 8575 own s_waitcnt 8576 lgkmcnt(0) and so do 8577 not need to be 8578 considered.) 8579 - s_waitcnt vmcnt(0) 8580 must happen after 8581 preceding 8582 global/generic load 8583 atomic/store 8584 atomic/atomicrmw 8585 with memory 8586 ordering of seq_cst 8587 and with equal or 8588 wider sync scope. 8589 (Note that seq_cst 8590 fences have their 8591 own s_waitcnt 8592 vmcnt(0) and so do 8593 not need to be 8594 considered.) 8595 - Ensures any 8596 preceding 8597 sequential 8598 consistent global/local 8599 memory instructions 8600 have completed 8601 before executing 8602 this sequentially 8603 consistent 8604 instruction. This 8605 prevents reordering 8606 a seq_cst store 8607 followed by a 8608 seq_cst load. (Note 8609 that seq_cst is 8610 stronger than 8611 acquire/release as 8612 the reordering of 8613 load acquire 8614 followed by a store 8615 release is 8616 prevented by the 8617 s_waitcnt of 8618 the release, but 8619 there is nothing 8620 preventing a store 8621 release followed by 8622 load acquire from 8623 completing out of 8624 order. The s_waitcnt 8625 could be placed after 8626 seq_store or before 8627 the seq_load. We 8628 choose the load to 8629 make the s_waitcnt be 8630 as late as possible 8631 so that the store 8632 may have already 8633 completed.) 8634 8635 2. *Following 8636 instructions same as 8637 corresponding load 8638 atomic acquire, 8639 except must generate 8640 all instructions even 8641 for OpenCL.* 8642 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8643 local address space cannot 8644 be used.* 8645 8646 *Same as corresponding 8647 load atomic acquire, 8648 except must generate 8649 all instructions even 8650 for OpenCL.* 8651 8652 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8653 - system - generic vmcnt(0) 8654 8655 - If TgSplit execution mode, 8656 omit lgkmcnt(0). 8657 - Could be split into 8658 separate s_waitcnt 8659 vmcnt(0) 8660 and s_waitcnt 8661 lgkmcnt(0) to allow 8662 them to be 8663 independently moved 8664 according to the 8665 following rules. 8666 - s_waitcnt lgkmcnt(0) 8667 must happen after 8668 preceding 8669 global/generic load 8670 atomic/store 8671 atomic/atomicrmw 8672 with memory 8673 ordering of seq_cst 8674 and with equal or 8675 wider sync scope. 8676 (Note that seq_cst 8677 fences have their 8678 own s_waitcnt 8679 lgkmcnt(0) and so do 8680 not need to be 8681 considered.) 8682 - s_waitcnt vmcnt(0) 8683 must happen after 8684 preceding 8685 global/generic load 8686 atomic/store 8687 atomic/atomicrmw 8688 with memory 8689 ordering of seq_cst 8690 and with equal or 8691 wider sync scope. 8692 (Note that seq_cst 8693 fences have their 8694 own s_waitcnt 8695 vmcnt(0) and so do 8696 not need to be 8697 considered.) 8698 - Ensures any 8699 preceding 8700 sequential 8701 consistent global 8702 memory instructions 8703 have completed 8704 before executing 8705 this sequentially 8706 consistent 8707 instruction. This 8708 prevents reordering 8709 a seq_cst store 8710 followed by a 8711 seq_cst load. (Note 8712 that seq_cst is 8713 stronger than 8714 acquire/release as 8715 the reordering of 8716 load acquire 8717 followed by a store 8718 release is 8719 prevented by the 8720 s_waitcnt of 8721 the release, but 8722 there is nothing 8723 preventing a store 8724 release followed by 8725 load acquire from 8726 completing out of 8727 order. The s_waitcnt 8728 could be placed after 8729 seq_store or before 8730 the seq_load. We 8731 choose the load to 8732 make the s_waitcnt be 8733 as late as possible 8734 so that the store 8735 may have already 8736 completed.) 8737 8738 2. *Following 8739 instructions same as 8740 corresponding load 8741 atomic acquire, 8742 except must generate 8743 all instructions even 8744 for OpenCL.* 8745 store atomic seq_cst - singlethread - global *Same as corresponding 8746 - wavefront - local store atomic release, 8747 - workgroup - generic except must generate 8748 - agent all instructions even 8749 - system for OpenCL.* 8750 atomicrmw seq_cst - singlethread - global *Same as corresponding 8751 - wavefront - local atomicrmw acq_rel, 8752 - workgroup - generic except must generate 8753 - agent all instructions even 8754 - system for OpenCL.* 8755 fence seq_cst - singlethread *none* *Same as corresponding 8756 - wavefront fence acq_rel, 8757 - workgroup except must generate 8758 - agent all instructions even 8759 - system for OpenCL.* 8760 ============ ============ ============== ========== ================================ 8761 8762.. _amdgpu-amdhsa-memory-model-gfx940: 8763 8764Memory Model GFX940 8765+++++++++++++++++++ 8766 8767For GFX940: 8768 8769* Each agent has multiple shader arrays (SA). 8770* Each SA has multiple compute units (CU). 8771* Each CU has multiple SIMDs that execute wavefronts. 8772* The wavefronts for a single work-group are executed in the same CU but may be 8773 executed by different SIMDs. The exception is when in tgsplit execution mode 8774 when the wavefronts may be executed by different SIMDs in different CUs. 8775* Each CU has a single LDS memory shared by the wavefronts of the work-groups 8776 executing on it. The exception is when in tgsplit execution mode when no LDS 8777 is allocated as wavefronts of the same work-group can be in different CUs. 8778* All LDS operations of a CU are performed as wavefront wide operations in a 8779 global order and involve no caching. Completion is reported to a wavefront in 8780 execution order. 8781* The LDS memory has multiple request queues shared by the SIMDs of a 8782 CU. Therefore, the LDS operations performed by different wavefronts of a 8783 work-group can be reordered relative to each other, which can result in 8784 reordering the visibility of vector memory operations with respect to LDS 8785 operations of other wavefronts in the same work-group. A ``s_waitcnt 8786 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8787 vector memory operations between wavefronts of a work-group, but not between 8788 operations performed by the same wavefront. 8789* The vector memory operations are performed as wavefront wide operations and 8790 completion is reported to a wavefront in execution order. The exception is 8791 that ``flat_load/store/atomic`` instructions can report out of vector memory 8792 order if they access LDS memory, and out of LDS operation order if they access 8793 global memory. 8794* The vector memory operations access a single vector L1 cache shared by all 8795 SIMDs a CU. Therefore: 8796 8797 * No special action is required for coherence between the lanes of a single 8798 wavefront. 8799 8800 * No special action is required for coherence between wavefronts in the same 8801 work-group since they execute on the same CU. The exception is when in 8802 tgsplit execution mode as wavefronts of the same work-group can be in 8803 different CUs and so a ``buffer_inv sc0`` is required which will invalidate 8804 the L1 cache. 8805 8806 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence 8807 between wavefronts executing in different work-groups as they may be 8808 executing on different CUs. 8809 8810 * Atomic read-modify-write instructions implicitly bypass the L1 cache. 8811 Therefore, they do not use the sc0 bit for coherence and instead use it to 8812 indicate if the instruction returns the original value being updated. They 8813 do use sc1 to indicate system or agent scope coherence. 8814 8815* The scalar memory operations access a scalar L1 cache shared by all wavefronts 8816 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 8817 scalar operations are used in a restricted way so do not impact the memory 8818 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 8819* The vector and scalar memory operations use an L2 cache. 8820 8821 * The gfx940 can be configured as a number of smaller agents with each having 8822 a single L2 shared by all CUs on the same agent, or as fewer (possibly one) 8823 larger agents with groups of CUs on each agent each sharing separate L2 8824 caches. 8825 * The L2 cache has independent channels to service disjoint ranges of virtual 8826 addresses. 8827 * Each CU has a separate request queue per channel for its associated L2. 8828 Therefore, the vector and scalar memory operations performed by wavefronts 8829 executing with different L1 caches and the same L2 cache can be reordered 8830 relative to each other. 8831 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between 8832 vector memory operations of different CUs. It ensures a previous vector 8833 memory operation has completed before executing a subsequent vector memory 8834 or LDS operation and so can be used to meet the requirements of acquire and 8835 release. 8836 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW 8837 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with 8838 the PTE C-bit set for memory not local to the L2. 8839 8840 * Any local memory cache lines will be automatically invalidated by writes 8841 from CUs associated with other L2 caches, or writes from the CPU, due to 8842 the cache probe caused by the PTE C-bit. 8843 * XGMI accesses from the CPU to local memory may be cached on the CPU. 8844 Subsequent access from the GPU will automatically invalidate or writeback 8845 the CPU cache due to the L2 probe filter. 8846 * To ensure coherence of local memory writes of CUs with different L1 caches 8847 in the same agent a ``buffer_wbl2`` is required. It does nothing if the 8848 agent is configured to have a single L2, or will writeback dirty L2 cache 8849 lines if configured to have multiple L2 caches. 8850 * To ensure coherence of local memory writes of CUs in different agents a 8851 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines. 8852 * To ensure coherence of local memory reads of CUs with different L1 caches 8853 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the 8854 agent is configured to have a single L2, or will invalidate non-local L2 8855 cache lines if configured to have multiple L2 caches. 8856 * To ensure coherence of local memory reads of CUs in different agents a 8857 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache 8858 lines if configured to have multiple L2 caches. 8859 8860 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE 8861 UC (uncached) which bypasses the L2. 8862 8863Scalar memory operations are only used to access memory that is proven to not 8864change during the execution of the kernel dispatch. This includes constant 8865address space and global address space for program scope ``const`` variables. 8866Therefore, the kernel machine code does not have to maintain the scalar cache to 8867ensure it is coherent with the vector caches. The scalar and vector caches are 8868invalidated between kernel dispatches by CP since constant address space data 8869may change between kernel dispatch executions. See 8870:ref:`amdgpu-amdhsa-memory-spaces`. 8871 8872The one exception is if scalar writes are used to spill SGPR registers. In this 8873case the AMDGPU backend ensures the memory location used to spill is never 8874accessed by vector memory operations at the same time. If scalar writes are used 8875then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8876return since the locations may be used for vector memory instructions by a 8877future wavefront that uses the same scratch area, or a function call that 8878creates a frame at the same address, respectively. There is no need for a 8879``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8880 8881For kernarg backing memory: 8882 8883* CP invalidates the L1 cache at the start of each kernel dispatch. 8884* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 8885 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 8886 cache. This also causes it to be treated as non-volatile and so is not 8887 invalidated by ``*_vol``. 8888* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8889 so the L2 cache will be coherent with the CPU and other agents. 8890 8891Scratch backing memory (which is used for the private address space) is accessed 8892with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 8893only accessed by a single thread, and is always write-before-read, there is 8894never a need to invalidate these entries from the L1 cache. Hence all cache 8895invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 8896 8897The code sequences used to implement the memory model for GFX940 are defined 8898in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`. 8899 8900 .. table:: AMDHSA Memory Model Code Sequences GFX940 8901 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table 8902 8903 ============ ============ ============== ========== ================================ 8904 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8905 Ordering Sync Scope Address GFX940 8906 Space 8907 ============ ============ ============== ========== ================================ 8908 **Non-Atomic** 8909 ------------------------------------------------------------------------------------ 8910 load *none* *none* - global - !volatile & !nontemporal 8911 - generic 8912 - private 1. buffer/global/flat_load 8913 - constant 8914 - !volatile & nontemporal 8915 8916 1. buffer/global/flat_load 8917 nt=1 8918 8919 - volatile 8920 8921 1. buffer/global/flat_load 8922 sc0=1 sc1=1 8923 2. s_waitcnt vmcnt(0) 8924 8925 - Must happen before 8926 any following volatile 8927 global/generic 8928 load/store. 8929 - Ensures that 8930 volatile 8931 operations to 8932 different 8933 addresses will not 8934 be reordered by 8935 hardware. 8936 8937 load *none* *none* - local 1. ds_load 8938 store *none* *none* - global - !volatile & !nontemporal 8939 - generic 8940 - private 1. buffer/global/flat_store 8941 - constant 8942 - !volatile & nontemporal 8943 8944 1. buffer/global/flat_store 8945 nt=1 8946 8947 - volatile 8948 8949 1. buffer/global/flat_store 8950 sc0=1 sc1=1 8951 2. s_waitcnt vmcnt(0) 8952 8953 - Must happen before 8954 any following volatile 8955 global/generic 8956 load/store. 8957 - Ensures that 8958 volatile 8959 operations to 8960 different 8961 addresses will not 8962 be reordered by 8963 hardware. 8964 8965 store *none* *none* - local 1. ds_store 8966 **Unordered Atomic** 8967 ------------------------------------------------------------------------------------ 8968 load atomic unordered *any* *any* *Same as non-atomic*. 8969 store atomic unordered *any* *any* *Same as non-atomic*. 8970 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8971 **Monotonic Atomic** 8972 ------------------------------------------------------------------------------------ 8973 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8974 - wavefront - generic 8975 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8976 - generic sc0=1 8977 load atomic monotonic - singlethread - local *If TgSplit execution mode, 8978 - wavefront local address space cannot 8979 - workgroup be used.* 8980 8981 1. ds_load 8982 load atomic monotonic - agent - global 1. buffer/global/flat_load 8983 - generic sc1=1 8984 load atomic monotonic - system - global 1. buffer/global/flat_load 8985 - generic sc0=1 sc1=1 8986 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8987 - wavefront - generic 8988 store atomic monotonic - workgroup - global 1. buffer/global/flat_store 8989 - generic sc0=1 8990 store atomic monotonic - agent - global 1. buffer/global/flat_store 8991 - generic sc1=1 8992 store atomic monotonic - system - global 1. buffer/global/flat_store 8993 - generic sc0=1 sc1=1 8994 store atomic monotonic - singlethread - local *If TgSplit execution mode, 8995 - wavefront local address space cannot 8996 - workgroup be used.* 8997 8998 1. ds_store 8999 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 9000 - wavefront - generic 9001 - workgroup 9002 - agent 9003 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 9004 - generic sc1=1 9005 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 9006 - wavefront local address space cannot 9007 - workgroup be used.* 9008 9009 1. ds_atomic 9010 **Acquire Atomic** 9011 ------------------------------------------------------------------------------------ 9012 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 9013 - wavefront - local 9014 - generic 9015 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1 9016 2. s_waitcnt vmcnt(0) 9017 9018 - If not TgSplit execution 9019 mode, omit. 9020 - Must happen before the 9021 following buffer_inv. 9022 9023 3. buffer_inv sc0=1 9024 9025 - If not TgSplit execution 9026 mode, omit. 9027 - Must happen before 9028 any following 9029 global/generic 9030 load/load 9031 atomic/store/store 9032 atomic/atomicrmw. 9033 - Ensures that 9034 following 9035 loads will not see 9036 stale data. 9037 9038 load atomic acquire - workgroup - local *If TgSplit execution mode, 9039 local address space cannot 9040 be used.* 9041 9042 1. ds_load 9043 2. s_waitcnt lgkmcnt(0) 9044 9045 - If OpenCL, omit. 9046 - Must happen before 9047 any following 9048 global/generic 9049 load/load 9050 atomic/store/store 9051 atomic/atomicrmw. 9052 - Ensures any 9053 following global 9054 data read is no 9055 older than the local load 9056 atomic value being 9057 acquired. 9058 9059 load atomic acquire - workgroup - generic 1. flat_load sc0=1 9060 2. s_waitcnt lgkm/vmcnt(0) 9061 9062 - Use lgkmcnt(0) if not 9063 TgSplit execution mode 9064 and vmcnt(0) if TgSplit 9065 execution mode. 9066 - If OpenCL, omit lgkmcnt(0). 9067 - Must happen before 9068 the following 9069 buffer_inv and any 9070 following global/generic 9071 load/load 9072 atomic/store/store 9073 atomic/atomicrmw. 9074 - Ensures any 9075 following global 9076 data read is no 9077 older than a local load 9078 atomic value being 9079 acquired. 9080 9081 3. buffer_inv sc0=1 9082 9083 - If not TgSplit execution 9084 mode, omit. 9085 - Ensures that 9086 following 9087 loads will not see 9088 stale data. 9089 9090 load atomic acquire - agent - global 1. buffer/global_load 9091 sc1=1 9092 2. s_waitcnt vmcnt(0) 9093 9094 - Must happen before 9095 following 9096 buffer_inv. 9097 - Ensures the load 9098 has completed 9099 before invalidating 9100 the cache. 9101 9102 3. buffer_inv sc1=1 9103 9104 - Must happen before 9105 any following 9106 global/generic 9107 load/load 9108 atomic/atomicrmw. 9109 - Ensures that 9110 following 9111 loads will not see 9112 stale global data. 9113 9114 load atomic acquire - system - global 1. buffer/global/flat_load 9115 sc0=1 sc1=1 9116 2. s_waitcnt vmcnt(0) 9117 9118 - Must happen before 9119 following 9120 buffer_inv. 9121 - Ensures the load 9122 has completed 9123 before invalidating 9124 the cache. 9125 9126 3. buffer_inv sc0=1 sc1=1 9127 9128 - Must happen before 9129 any following 9130 global/generic 9131 load/load 9132 atomic/atomicrmw. 9133 - Ensures that 9134 following 9135 loads will not see 9136 stale MTYPE NC global data. 9137 MTYPE RW and CC memory will 9138 never be stale due to the 9139 memory probes. 9140 9141 load atomic acquire - agent - generic 1. flat_load sc1=1 9142 2. s_waitcnt vmcnt(0) & 9143 lgkmcnt(0) 9144 9145 - If TgSplit execution mode, 9146 omit lgkmcnt(0). 9147 - If OpenCL omit 9148 lgkmcnt(0). 9149 - Must happen before 9150 following 9151 buffer_inv. 9152 - Ensures the flat_load 9153 has completed 9154 before invalidating 9155 the cache. 9156 9157 3. buffer_inv sc1=1 9158 9159 - Must happen before 9160 any following 9161 global/generic 9162 load/load 9163 atomic/atomicrmw. 9164 - Ensures that 9165 following loads 9166 will not see stale 9167 global data. 9168 9169 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1 9170 2. s_waitcnt vmcnt(0) & 9171 lgkmcnt(0) 9172 9173 - If TgSplit execution mode, 9174 omit lgkmcnt(0). 9175 - If OpenCL omit 9176 lgkmcnt(0). 9177 - Must happen before 9178 the following 9179 buffer_inv. 9180 - Ensures the flat_load 9181 has completed 9182 before invalidating 9183 the caches. 9184 9185 3. buffer_inv sc0=1 sc1=1 9186 9187 - Must happen before 9188 any following 9189 global/generic 9190 load/load 9191 atomic/atomicrmw. 9192 - Ensures that 9193 following 9194 loads will not see 9195 stale MTYPE NC global data. 9196 MTYPE RW and CC memory will 9197 never be stale due to the 9198 memory probes. 9199 9200 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 9201 - wavefront - generic 9202 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 9203 - wavefront local address space cannot 9204 be used.* 9205 9206 1. ds_atomic 9207 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 9208 2. s_waitcnt vmcnt(0) 9209 9210 - If not TgSplit execution 9211 mode, omit. 9212 - Must happen before the 9213 following buffer_inv. 9214 - Ensures the atomicrmw 9215 has completed 9216 before invalidating 9217 the cache. 9218 9219 3. buffer_inv sc0=1 9220 9221 - If not TgSplit execution 9222 mode, omit. 9223 - Must happen before 9224 any following 9225 global/generic 9226 load/load 9227 atomic/atomicrmw. 9228 - Ensures that 9229 following loads 9230 will not see stale 9231 global data. 9232 9233 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 9234 local address space cannot 9235 be used.* 9236 9237 1. ds_atomic 9238 2. s_waitcnt lgkmcnt(0) 9239 9240 - If OpenCL, omit. 9241 - Must happen before 9242 any following 9243 global/generic 9244 load/load 9245 atomic/store/store 9246 atomic/atomicrmw. 9247 - Ensures any 9248 following global 9249 data read is no 9250 older than the local 9251 atomicrmw value 9252 being acquired. 9253 9254 atomicrmw acquire - workgroup - generic 1. flat_atomic 9255 2. s_waitcnt lgkm/vmcnt(0) 9256 9257 - Use lgkmcnt(0) if not 9258 TgSplit execution mode 9259 and vmcnt(0) if TgSplit 9260 execution mode. 9261 - If OpenCL, omit lgkmcnt(0). 9262 - Must happen before 9263 the following 9264 buffer_inv and 9265 any following 9266 global/generic 9267 load/load 9268 atomic/store/store 9269 atomic/atomicrmw. 9270 - Ensures any 9271 following global 9272 data read is no 9273 older than a local 9274 atomicrmw value 9275 being acquired. 9276 9277 3. buffer_inv sc0=1 9278 9279 - If not TgSplit execution 9280 mode, omit. 9281 - Ensures that 9282 following 9283 loads will not see 9284 stale data. 9285 9286 atomicrmw acquire - agent - global 1. buffer/global_atomic 9287 2. s_waitcnt vmcnt(0) 9288 9289 - Must happen before 9290 following 9291 buffer_inv. 9292 - Ensures the 9293 atomicrmw has 9294 completed before 9295 invalidating the 9296 cache. 9297 9298 3. buffer_inv sc1=1 9299 9300 - Must happen before 9301 any following 9302 global/generic 9303 load/load 9304 atomic/atomicrmw. 9305 - Ensures that 9306 following loads 9307 will not see stale 9308 global data. 9309 9310 atomicrmw acquire - system - global 1. buffer/global_atomic 9311 sc1=1 9312 2. s_waitcnt vmcnt(0) 9313 9314 - Must happen before 9315 following 9316 buffer_inv. 9317 - Ensures the 9318 atomicrmw has 9319 completed before 9320 invalidating the 9321 caches. 9322 9323 3. buffer_inv sc0=1 sc1=1 9324 9325 - Must happen before 9326 any following 9327 global/generic 9328 load/load 9329 atomic/atomicrmw. 9330 - Ensures that 9331 following 9332 loads will not see 9333 stale MTYPE NC global data. 9334 MTYPE RW and CC memory will 9335 never be stale due to the 9336 memory probes. 9337 9338 atomicrmw acquire - agent - generic 1. flat_atomic 9339 2. s_waitcnt vmcnt(0) & 9340 lgkmcnt(0) 9341 9342 - If TgSplit execution mode, 9343 omit lgkmcnt(0). 9344 - If OpenCL, omit 9345 lgkmcnt(0). 9346 - Must happen before 9347 following 9348 buffer_inv. 9349 - Ensures the 9350 atomicrmw has 9351 completed before 9352 invalidating the 9353 cache. 9354 9355 3. buffer_inv sc1=1 9356 9357 - Must happen before 9358 any following 9359 global/generic 9360 load/load 9361 atomic/atomicrmw. 9362 - Ensures that 9363 following loads 9364 will not see stale 9365 global data. 9366 9367 atomicrmw acquire - system - generic 1. flat_atomic sc1=1 9368 2. s_waitcnt vmcnt(0) & 9369 lgkmcnt(0) 9370 9371 - If TgSplit execution mode, 9372 omit lgkmcnt(0). 9373 - If OpenCL, omit 9374 lgkmcnt(0). 9375 - Must happen before 9376 following 9377 buffer_inv. 9378 - Ensures the 9379 atomicrmw has 9380 completed before 9381 invalidating the 9382 caches. 9383 9384 3. buffer_inv sc0=1 sc1=1 9385 9386 - Must happen before 9387 any following 9388 global/generic 9389 load/load 9390 atomic/atomicrmw. 9391 - Ensures that 9392 following 9393 loads will not see 9394 stale MTYPE NC global data. 9395 MTYPE RW and CC memory will 9396 never be stale due to the 9397 memory probes. 9398 9399 fence acquire - singlethread *none* *none* 9400 - wavefront 9401 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9402 9403 - Use lgkmcnt(0) if not 9404 TgSplit execution mode 9405 and vmcnt(0) if TgSplit 9406 execution mode. 9407 - If OpenCL and 9408 address space is 9409 not generic, omit 9410 lgkmcnt(0). 9411 - If OpenCL and 9412 address space is 9413 local, omit 9414 vmcnt(0). 9415 - However, since LLVM 9416 currently has no 9417 address space on 9418 the fence need to 9419 conservatively 9420 always generate. If 9421 fence had an 9422 address space then 9423 set to address 9424 space of OpenCL 9425 fence flag, or to 9426 generic if both 9427 local and global 9428 flags are 9429 specified. 9430 - s_waitcnt vmcnt(0) 9431 must happen after 9432 any preceding 9433 global/generic load 9434 atomic/ 9435 atomicrmw 9436 with an equal or 9437 wider sync scope 9438 and memory ordering 9439 stronger than 9440 unordered (this is 9441 termed the 9442 fence-paired-atomic). 9443 - s_waitcnt lgkmcnt(0) 9444 must happen after 9445 any preceding 9446 local/generic load 9447 atomic/atomicrmw 9448 with an equal or 9449 wider sync scope 9450 and memory ordering 9451 stronger than 9452 unordered (this is 9453 termed the 9454 fence-paired-atomic). 9455 - Must happen before 9456 the following 9457 buffer_inv and 9458 any following 9459 global/generic 9460 load/load 9461 atomic/store/store 9462 atomic/atomicrmw. 9463 - Ensures any 9464 following global 9465 data read is no 9466 older than the 9467 value read by the 9468 fence-paired-atomic. 9469 9470 3. buffer_inv sc0=1 9471 9472 - If not TgSplit execution 9473 mode, omit. 9474 - Ensures that 9475 following 9476 loads will not see 9477 stale data. 9478 9479 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 9480 vmcnt(0) 9481 9482 - If TgSplit execution mode, 9483 omit lgkmcnt(0). 9484 - If OpenCL and 9485 address space is 9486 not generic, omit 9487 lgkmcnt(0). 9488 - However, since LLVM 9489 currently has no 9490 address space on 9491 the fence need to 9492 conservatively 9493 always generate 9494 (see comment for 9495 previous fence). 9496 - Could be split into 9497 separate s_waitcnt 9498 vmcnt(0) and 9499 s_waitcnt 9500 lgkmcnt(0) to allow 9501 them to be 9502 independently moved 9503 according to the 9504 following rules. 9505 - s_waitcnt vmcnt(0) 9506 must happen after 9507 any preceding 9508 global/generic load 9509 atomic/atomicrmw 9510 with an equal or 9511 wider sync scope 9512 and memory ordering 9513 stronger than 9514 unordered (this is 9515 termed the 9516 fence-paired-atomic). 9517 - s_waitcnt lgkmcnt(0) 9518 must happen after 9519 any preceding 9520 local/generic load 9521 atomic/atomicrmw 9522 with an equal or 9523 wider sync scope 9524 and memory ordering 9525 stronger than 9526 unordered (this is 9527 termed the 9528 fence-paired-atomic). 9529 - Must happen before 9530 the following 9531 buffer_inv. 9532 - Ensures that the 9533 fence-paired atomic 9534 has completed 9535 before invalidating 9536 the 9537 cache. Therefore 9538 any following 9539 locations read must 9540 be no older than 9541 the value read by 9542 the 9543 fence-paired-atomic. 9544 9545 2. buffer_inv sc1=1 9546 9547 - Must happen before any 9548 following global/generic 9549 load/load 9550 atomic/store/store 9551 atomic/atomicrmw. 9552 - Ensures that 9553 following loads 9554 will not see stale 9555 global data. 9556 9557 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 9558 vmcnt(0) 9559 9560 - If TgSplit execution mode, 9561 omit lgkmcnt(0). 9562 - If OpenCL and 9563 address space is 9564 not generic, omit 9565 lgkmcnt(0). 9566 - However, since LLVM 9567 currently has no 9568 address space on 9569 the fence need to 9570 conservatively 9571 always generate 9572 (see comment for 9573 previous fence). 9574 - Could be split into 9575 separate s_waitcnt 9576 vmcnt(0) and 9577 s_waitcnt 9578 lgkmcnt(0) to allow 9579 them to be 9580 independently moved 9581 according to the 9582 following rules. 9583 - s_waitcnt vmcnt(0) 9584 must happen after 9585 any preceding 9586 global/generic load 9587 atomic/atomicrmw 9588 with an equal or 9589 wider sync scope 9590 and memory ordering 9591 stronger than 9592 unordered (this is 9593 termed the 9594 fence-paired-atomic). 9595 - s_waitcnt lgkmcnt(0) 9596 must happen after 9597 any preceding 9598 local/generic load 9599 atomic/atomicrmw 9600 with an equal or 9601 wider sync scope 9602 and memory ordering 9603 stronger than 9604 unordered (this is 9605 termed the 9606 fence-paired-atomic). 9607 - Must happen before 9608 the following 9609 buffer_inv. 9610 - Ensures that the 9611 fence-paired atomic 9612 has completed 9613 before invalidating 9614 the 9615 cache. Therefore 9616 any following 9617 locations read must 9618 be no older than 9619 the value read by 9620 the 9621 fence-paired-atomic. 9622 9623 2. buffer_inv sc0=1 sc1=1 9624 9625 - Must happen before any 9626 following global/generic 9627 load/load 9628 atomic/store/store 9629 atomic/atomicrmw. 9630 - Ensures that 9631 following loads 9632 will not see stale 9633 global data. 9634 9635 **Release Atomic** 9636 ------------------------------------------------------------------------------------ 9637 store atomic release - singlethread - global 1. buffer/global/flat_store 9638 - wavefront - generic 9639 store atomic release - singlethread - local *If TgSplit execution mode, 9640 - wavefront local address space cannot 9641 be used.* 9642 9643 1. ds_store 9644 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9645 - generic 9646 - Use lgkmcnt(0) if not 9647 TgSplit execution mode 9648 and vmcnt(0) if TgSplit 9649 execution mode. 9650 - If OpenCL, omit lgkmcnt(0). 9651 - s_waitcnt vmcnt(0) 9652 must happen after 9653 any preceding 9654 global/generic load/store/ 9655 load atomic/store atomic/ 9656 atomicrmw. 9657 - s_waitcnt lgkmcnt(0) 9658 must happen after 9659 any preceding 9660 local/generic 9661 load/store/load 9662 atomic/store 9663 atomic/atomicrmw. 9664 - Must happen before 9665 the following 9666 store. 9667 - Ensures that all 9668 memory operations 9669 have 9670 completed before 9671 performing the 9672 store that is being 9673 released. 9674 9675 2. buffer/global/flat_store sc0=1 9676 store atomic release - workgroup - local *If TgSplit execution mode, 9677 local address space cannot 9678 be used.* 9679 9680 1. ds_store 9681 store atomic release - agent - global 1. buffer_wbl2 sc1=1 9682 - generic 9683 - Must happen before 9684 following s_waitcnt. 9685 - Performs L2 writeback to 9686 ensure previous 9687 global/generic 9688 store/atomicrmw are 9689 visible at agent scope. 9690 9691 2. s_waitcnt lgkmcnt(0) & 9692 vmcnt(0) 9693 9694 - If TgSplit execution mode, 9695 omit lgkmcnt(0). 9696 - If OpenCL and 9697 address space is 9698 not generic, omit 9699 lgkmcnt(0). 9700 - Could be split into 9701 separate s_waitcnt 9702 vmcnt(0) and 9703 s_waitcnt 9704 lgkmcnt(0) to allow 9705 them to be 9706 independently moved 9707 according to the 9708 following rules. 9709 - s_waitcnt vmcnt(0) 9710 must happen after 9711 any preceding 9712 global/generic 9713 load/store/load 9714 atomic/store 9715 atomic/atomicrmw. 9716 - s_waitcnt lgkmcnt(0) 9717 must happen after 9718 any preceding 9719 local/generic 9720 load/store/load 9721 atomic/store 9722 atomic/atomicrmw. 9723 - Must happen before 9724 the following 9725 store. 9726 - Ensures that all 9727 memory operations 9728 to memory have 9729 completed before 9730 performing the 9731 store that is being 9732 released. 9733 9734 3. buffer/global/flat_store sc1=1 9735 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9736 - generic 9737 - Must happen before 9738 following s_waitcnt. 9739 - Performs L2 writeback to 9740 ensure previous 9741 global/generic 9742 store/atomicrmw are 9743 visible at system scope. 9744 9745 2. s_waitcnt lgkmcnt(0) & 9746 vmcnt(0) 9747 9748 - If TgSplit execution mode, 9749 omit lgkmcnt(0). 9750 - If OpenCL and 9751 address space is 9752 not generic, omit 9753 lgkmcnt(0). 9754 - Could be split into 9755 separate s_waitcnt 9756 vmcnt(0) and 9757 s_waitcnt 9758 lgkmcnt(0) to allow 9759 them to be 9760 independently moved 9761 according to the 9762 following rules. 9763 - s_waitcnt vmcnt(0) 9764 must happen after any 9765 preceding 9766 global/generic 9767 load/store/load 9768 atomic/store 9769 atomic/atomicrmw. 9770 - s_waitcnt lgkmcnt(0) 9771 must happen after any 9772 preceding 9773 local/generic 9774 load/store/load 9775 atomic/store 9776 atomic/atomicrmw. 9777 - Must happen before 9778 the following 9779 store. 9780 - Ensures that all 9781 memory operations 9782 to memory and the L2 9783 writeback have 9784 completed before 9785 performing the 9786 store that is being 9787 released. 9788 9789 3. buffer/global/flat_store 9790 sc0=1 sc1=1 9791 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 9792 - wavefront - generic 9793 atomicrmw release - singlethread - local *If TgSplit execution mode, 9794 - wavefront local address space cannot 9795 be used.* 9796 9797 1. ds_atomic 9798 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9799 - generic 9800 - Use lgkmcnt(0) if not 9801 TgSplit execution mode 9802 and vmcnt(0) if TgSplit 9803 execution mode. 9804 - If OpenCL, omit 9805 lgkmcnt(0). 9806 - s_waitcnt vmcnt(0) 9807 must happen after 9808 any preceding 9809 global/generic load/store/ 9810 load atomic/store atomic/ 9811 atomicrmw. 9812 - s_waitcnt lgkmcnt(0) 9813 must happen after 9814 any preceding 9815 local/generic 9816 load/store/load 9817 atomic/store 9818 atomic/atomicrmw. 9819 - Must happen before 9820 the following 9821 atomicrmw. 9822 - Ensures that all 9823 memory operations 9824 have 9825 completed before 9826 performing the 9827 atomicrmw that is 9828 being released. 9829 9830 2. buffer/global/flat_atomic sc0=1 9831 atomicrmw release - workgroup - local *If TgSplit execution mode, 9832 local address space cannot 9833 be used.* 9834 9835 1. ds_atomic 9836 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1 9837 - generic 9838 - Must happen before 9839 following s_waitcnt. 9840 - Performs L2 writeback to 9841 ensure previous 9842 global/generic 9843 store/atomicrmw are 9844 visible at agent scope. 9845 9846 2. s_waitcnt lgkmcnt(0) & 9847 vmcnt(0) 9848 9849 - If TgSplit execution mode, 9850 omit lgkmcnt(0). 9851 - If OpenCL, omit 9852 lgkmcnt(0). 9853 - Could be split into 9854 separate s_waitcnt 9855 vmcnt(0) and 9856 s_waitcnt 9857 lgkmcnt(0) to allow 9858 them to be 9859 independently moved 9860 according to the 9861 following rules. 9862 - s_waitcnt vmcnt(0) 9863 must happen after 9864 any preceding 9865 global/generic 9866 load/store/load 9867 atomic/store 9868 atomic/atomicrmw. 9869 - s_waitcnt lgkmcnt(0) 9870 must happen after 9871 any preceding 9872 local/generic 9873 load/store/load 9874 atomic/store 9875 atomic/atomicrmw. 9876 - Must happen before 9877 the following 9878 atomicrmw. 9879 - Ensures that all 9880 memory operations 9881 to global and local 9882 have completed 9883 before performing 9884 the atomicrmw that 9885 is being released. 9886 9887 3. buffer/global/flat_atomic sc1=1 9888 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9889 - generic 9890 - Must happen before 9891 following s_waitcnt. 9892 - Performs L2 writeback to 9893 ensure previous 9894 global/generic 9895 store/atomicrmw are 9896 visible at system scope. 9897 9898 2. s_waitcnt lgkmcnt(0) & 9899 vmcnt(0) 9900 9901 - If TgSplit execution mode, 9902 omit lgkmcnt(0). 9903 - If OpenCL, omit 9904 lgkmcnt(0). 9905 - Could be split into 9906 separate s_waitcnt 9907 vmcnt(0) and 9908 s_waitcnt 9909 lgkmcnt(0) to allow 9910 them to be 9911 independently moved 9912 according to the 9913 following rules. 9914 - s_waitcnt vmcnt(0) 9915 must happen after 9916 any preceding 9917 global/generic 9918 load/store/load 9919 atomic/store 9920 atomic/atomicrmw. 9921 - s_waitcnt lgkmcnt(0) 9922 must happen after 9923 any preceding 9924 local/generic 9925 load/store/load 9926 atomic/store 9927 atomic/atomicrmw. 9928 - Must happen before 9929 the following 9930 atomicrmw. 9931 - Ensures that all 9932 memory operations 9933 to memory and the L2 9934 writeback have 9935 completed before 9936 performing the 9937 store that is being 9938 released. 9939 9940 3. buffer/global/flat_atomic 9941 sc0=1 sc1=1 9942 fence release - singlethread *none* *none* 9943 - wavefront 9944 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9945 9946 - Use lgkmcnt(0) if not 9947 TgSplit execution mode 9948 and vmcnt(0) if TgSplit 9949 execution mode. 9950 - If OpenCL and 9951 address space is 9952 not generic, omit 9953 lgkmcnt(0). 9954 - If OpenCL and 9955 address space is 9956 local, omit 9957 vmcnt(0). 9958 - However, since LLVM 9959 currently has no 9960 address space on 9961 the fence need to 9962 conservatively 9963 always generate. If 9964 fence had an 9965 address space then 9966 set to address 9967 space of OpenCL 9968 fence flag, or to 9969 generic if both 9970 local and global 9971 flags are 9972 specified. 9973 - s_waitcnt vmcnt(0) 9974 must happen after 9975 any preceding 9976 global/generic 9977 load/store/ 9978 load atomic/store atomic/ 9979 atomicrmw. 9980 - s_waitcnt lgkmcnt(0) 9981 must happen after 9982 any preceding 9983 local/generic 9984 load/load 9985 atomic/store/store 9986 atomic/atomicrmw. 9987 - Must happen before 9988 any following store 9989 atomic/atomicrmw 9990 with an equal or 9991 wider sync scope 9992 and memory ordering 9993 stronger than 9994 unordered (this is 9995 termed the 9996 fence-paired-atomic). 9997 - Ensures that all 9998 memory operations 9999 have 10000 completed before 10001 performing the 10002 following 10003 fence-paired-atomic. 10004 10005 fence release - agent *none* 1. buffer_wbl2 sc1=1 10006 10007 - If OpenCL and 10008 address space is 10009 local, omit. 10010 - Must happen before 10011 following s_waitcnt. 10012 - Performs L2 writeback to 10013 ensure previous 10014 global/generic 10015 store/atomicrmw are 10016 visible at agent scope. 10017 10018 2. s_waitcnt lgkmcnt(0) & 10019 vmcnt(0) 10020 10021 - If TgSplit execution mode, 10022 omit lgkmcnt(0). 10023 - If OpenCL and 10024 address space is 10025 not generic, omit 10026 lgkmcnt(0). 10027 - If OpenCL and 10028 address space is 10029 local, omit 10030 vmcnt(0). 10031 - However, since LLVM 10032 currently has no 10033 address space on 10034 the fence need to 10035 conservatively 10036 always generate. If 10037 fence had an 10038 address space then 10039 set to address 10040 space of OpenCL 10041 fence flag, or to 10042 generic if both 10043 local and global 10044 flags are 10045 specified. 10046 - Could be split into 10047 separate s_waitcnt 10048 vmcnt(0) and 10049 s_waitcnt 10050 lgkmcnt(0) to allow 10051 them to be 10052 independently moved 10053 according to the 10054 following rules. 10055 - s_waitcnt vmcnt(0) 10056 must happen after 10057 any preceding 10058 global/generic 10059 load/store/load 10060 atomic/store 10061 atomic/atomicrmw. 10062 - s_waitcnt lgkmcnt(0) 10063 must happen after 10064 any preceding 10065 local/generic 10066 load/store/load 10067 atomic/store 10068 atomic/atomicrmw. 10069 - Must happen before 10070 any following store 10071 atomic/atomicrmw 10072 with an equal or 10073 wider sync scope 10074 and memory ordering 10075 stronger than 10076 unordered (this is 10077 termed the 10078 fence-paired-atomic). 10079 - Ensures that all 10080 memory operations 10081 have 10082 completed before 10083 performing the 10084 following 10085 fence-paired-atomic. 10086 10087 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1 10088 10089 - Must happen before 10090 following s_waitcnt. 10091 - Performs L2 writeback to 10092 ensure previous 10093 global/generic 10094 store/atomicrmw are 10095 visible at system scope. 10096 10097 2. s_waitcnt lgkmcnt(0) & 10098 vmcnt(0) 10099 10100 - If TgSplit execution mode, 10101 omit lgkmcnt(0). 10102 - If OpenCL and 10103 address space is 10104 not generic, omit 10105 lgkmcnt(0). 10106 - If OpenCL and 10107 address space is 10108 local, omit 10109 vmcnt(0). 10110 - However, since LLVM 10111 currently has no 10112 address space on 10113 the fence need to 10114 conservatively 10115 always generate. If 10116 fence had an 10117 address space then 10118 set to address 10119 space of OpenCL 10120 fence flag, or to 10121 generic if both 10122 local and global 10123 flags are 10124 specified. 10125 - Could be split into 10126 separate s_waitcnt 10127 vmcnt(0) and 10128 s_waitcnt 10129 lgkmcnt(0) to allow 10130 them to be 10131 independently moved 10132 according to the 10133 following rules. 10134 - s_waitcnt vmcnt(0) 10135 must happen after 10136 any preceding 10137 global/generic 10138 load/store/load 10139 atomic/store 10140 atomic/atomicrmw. 10141 - s_waitcnt lgkmcnt(0) 10142 must happen after 10143 any preceding 10144 local/generic 10145 load/store/load 10146 atomic/store 10147 atomic/atomicrmw. 10148 - Must happen before 10149 any following store 10150 atomic/atomicrmw 10151 with an equal or 10152 wider sync scope 10153 and memory ordering 10154 stronger than 10155 unordered (this is 10156 termed the 10157 fence-paired-atomic). 10158 - Ensures that all 10159 memory operations 10160 have 10161 completed before 10162 performing the 10163 following 10164 fence-paired-atomic. 10165 10166 **Acquire-Release Atomic** 10167 ------------------------------------------------------------------------------------ 10168 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 10169 - wavefront - generic 10170 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 10171 - wavefront local address space cannot 10172 be used.* 10173 10174 1. ds_atomic 10175 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10176 10177 - Use lgkmcnt(0) if not 10178 TgSplit execution mode 10179 and vmcnt(0) if TgSplit 10180 execution mode. 10181 - If OpenCL, omit 10182 lgkmcnt(0). 10183 - Must happen after 10184 any preceding 10185 local/generic 10186 load/store/load 10187 atomic/store 10188 atomic/atomicrmw. 10189 - s_waitcnt vmcnt(0) 10190 must happen after 10191 any preceding 10192 global/generic load/store/ 10193 load atomic/store atomic/ 10194 atomicrmw. 10195 - s_waitcnt lgkmcnt(0) 10196 must happen after 10197 any preceding 10198 local/generic 10199 load/store/load 10200 atomic/store 10201 atomic/atomicrmw. 10202 - Must happen before 10203 the following 10204 atomicrmw. 10205 - Ensures that all 10206 memory operations 10207 have 10208 completed before 10209 performing the 10210 atomicrmw that is 10211 being released. 10212 10213 2. buffer/global_atomic 10214 3. s_waitcnt vmcnt(0) 10215 10216 - If not TgSplit execution 10217 mode, omit. 10218 - Must happen before 10219 the following 10220 buffer_inv. 10221 - Ensures any 10222 following global 10223 data read is no 10224 older than the 10225 atomicrmw value 10226 being acquired. 10227 10228 4. buffer_inv sc0=1 10229 10230 - If not TgSplit execution 10231 mode, omit. 10232 - Ensures that 10233 following 10234 loads will not see 10235 stale data. 10236 10237 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 10238 local address space cannot 10239 be used.* 10240 10241 1. ds_atomic 10242 2. s_waitcnt lgkmcnt(0) 10243 10244 - If OpenCL, omit. 10245 - Must happen before 10246 any following 10247 global/generic 10248 load/load 10249 atomic/store/store 10250 atomic/atomicrmw. 10251 - Ensures any 10252 following global 10253 data read is no 10254 older than the local load 10255 atomic value being 10256 acquired. 10257 10258 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 10259 10260 - Use lgkmcnt(0) if not 10261 TgSplit execution mode 10262 and vmcnt(0) if TgSplit 10263 execution mode. 10264 - If OpenCL, omit 10265 lgkmcnt(0). 10266 - s_waitcnt vmcnt(0) 10267 must happen after 10268 any preceding 10269 global/generic load/store/ 10270 load atomic/store atomic/ 10271 atomicrmw. 10272 - s_waitcnt lgkmcnt(0) 10273 must happen after 10274 any preceding 10275 local/generic 10276 load/store/load 10277 atomic/store 10278 atomic/atomicrmw. 10279 - Must happen before 10280 the following 10281 atomicrmw. 10282 - Ensures that all 10283 memory operations 10284 have 10285 completed before 10286 performing the 10287 atomicrmw that is 10288 being released. 10289 10290 2. flat_atomic 10291 3. s_waitcnt lgkmcnt(0) & 10292 vmcnt(0) 10293 10294 - If not TgSplit execution 10295 mode, omit vmcnt(0). 10296 - If OpenCL, omit 10297 lgkmcnt(0). 10298 - Must happen before 10299 the following 10300 buffer_inv and 10301 any following 10302 global/generic 10303 load/load 10304 atomic/store/store 10305 atomic/atomicrmw. 10306 - Ensures any 10307 following global 10308 data read is no 10309 older than a local load 10310 atomic value being 10311 acquired. 10312 10313 3. buffer_inv sc0=1 10314 10315 - If not TgSplit execution 10316 mode, omit. 10317 - Ensures that 10318 following 10319 loads will not see 10320 stale data. 10321 10322 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1 10323 10324 - Must happen before 10325 following s_waitcnt. 10326 - Performs L2 writeback to 10327 ensure previous 10328 global/generic 10329 store/atomicrmw are 10330 visible at agent scope. 10331 10332 2. s_waitcnt lgkmcnt(0) & 10333 vmcnt(0) 10334 10335 - If TgSplit execution mode, 10336 omit lgkmcnt(0). 10337 - If OpenCL, omit 10338 lgkmcnt(0). 10339 - Could be split into 10340 separate s_waitcnt 10341 vmcnt(0) and 10342 s_waitcnt 10343 lgkmcnt(0) to allow 10344 them to be 10345 independently moved 10346 according to the 10347 following rules. 10348 - s_waitcnt vmcnt(0) 10349 must happen after 10350 any preceding 10351 global/generic 10352 load/store/load 10353 atomic/store 10354 atomic/atomicrmw. 10355 - s_waitcnt lgkmcnt(0) 10356 must happen after 10357 any preceding 10358 local/generic 10359 load/store/load 10360 atomic/store 10361 atomic/atomicrmw. 10362 - Must happen before 10363 the following 10364 atomicrmw. 10365 - Ensures that all 10366 memory operations 10367 to global have 10368 completed before 10369 performing the 10370 atomicrmw that is 10371 being released. 10372 10373 3. buffer/global_atomic 10374 4. s_waitcnt vmcnt(0) 10375 10376 - Must happen before 10377 following 10378 buffer_inv. 10379 - Ensures the 10380 atomicrmw has 10381 completed before 10382 invalidating the 10383 cache. 10384 10385 5. buffer_inv sc1=1 10386 10387 - Must happen before 10388 any following 10389 global/generic 10390 load/load 10391 atomic/atomicrmw. 10392 - Ensures that 10393 following loads 10394 will not see stale 10395 global data. 10396 10397 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1 10398 10399 - Must happen before 10400 following s_waitcnt. 10401 - Performs L2 writeback to 10402 ensure previous 10403 global/generic 10404 store/atomicrmw are 10405 visible at system scope. 10406 10407 2. s_waitcnt lgkmcnt(0) & 10408 vmcnt(0) 10409 10410 - If TgSplit execution mode, 10411 omit lgkmcnt(0). 10412 - If OpenCL, omit 10413 lgkmcnt(0). 10414 - Could be split into 10415 separate s_waitcnt 10416 vmcnt(0) and 10417 s_waitcnt 10418 lgkmcnt(0) to allow 10419 them to be 10420 independently moved 10421 according to the 10422 following rules. 10423 - s_waitcnt vmcnt(0) 10424 must happen after 10425 any preceding 10426 global/generic 10427 load/store/load 10428 atomic/store 10429 atomic/atomicrmw. 10430 - s_waitcnt lgkmcnt(0) 10431 must happen after 10432 any preceding 10433 local/generic 10434 load/store/load 10435 atomic/store 10436 atomic/atomicrmw. 10437 - Must happen before 10438 the following 10439 atomicrmw. 10440 - Ensures that all 10441 memory operations 10442 to global and L2 writeback 10443 have completed before 10444 performing the 10445 atomicrmw that is 10446 being released. 10447 10448 3. buffer/global_atomic 10449 sc1=1 10450 4. s_waitcnt vmcnt(0) 10451 10452 - Must happen before 10453 following 10454 buffer_inv. 10455 - Ensures the 10456 atomicrmw has 10457 completed before 10458 invalidating the 10459 caches. 10460 10461 5. buffer_inv sc0=1 sc1=1 10462 10463 - Must happen before 10464 any following 10465 global/generic 10466 load/load 10467 atomic/atomicrmw. 10468 - Ensures that 10469 following loads 10470 will not see stale 10471 MTYPE NC global data. 10472 MTYPE RW and CC memory will 10473 never be stale due to the 10474 memory probes. 10475 10476 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1 10477 10478 - Must happen before 10479 following s_waitcnt. 10480 - Performs L2 writeback to 10481 ensure previous 10482 global/generic 10483 store/atomicrmw are 10484 visible at agent scope. 10485 10486 2. s_waitcnt lgkmcnt(0) & 10487 vmcnt(0) 10488 10489 - If TgSplit execution mode, 10490 omit lgkmcnt(0). 10491 - If OpenCL, omit 10492 lgkmcnt(0). 10493 - Could be split into 10494 separate s_waitcnt 10495 vmcnt(0) and 10496 s_waitcnt 10497 lgkmcnt(0) to allow 10498 them to be 10499 independently moved 10500 according to the 10501 following rules. 10502 - s_waitcnt vmcnt(0) 10503 must happen after 10504 any preceding 10505 global/generic 10506 load/store/load 10507 atomic/store 10508 atomic/atomicrmw. 10509 - s_waitcnt lgkmcnt(0) 10510 must happen after 10511 any preceding 10512 local/generic 10513 load/store/load 10514 atomic/store 10515 atomic/atomicrmw. 10516 - Must happen before 10517 the following 10518 atomicrmw. 10519 - Ensures that all 10520 memory operations 10521 to global have 10522 completed before 10523 performing the 10524 atomicrmw that is 10525 being released. 10526 10527 3. flat_atomic 10528 4. s_waitcnt vmcnt(0) & 10529 lgkmcnt(0) 10530 10531 - If TgSplit execution mode, 10532 omit lgkmcnt(0). 10533 - If OpenCL, omit 10534 lgkmcnt(0). 10535 - Must happen before 10536 following 10537 buffer_inv. 10538 - Ensures the 10539 atomicrmw has 10540 completed before 10541 invalidating the 10542 cache. 10543 10544 5. buffer_inv sc1=1 10545 10546 - Must happen before 10547 any following 10548 global/generic 10549 load/load 10550 atomic/atomicrmw. 10551 - Ensures that 10552 following loads 10553 will not see stale 10554 global data. 10555 10556 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1 10557 10558 - Must happen before 10559 following s_waitcnt. 10560 - Performs L2 writeback to 10561 ensure previous 10562 global/generic 10563 store/atomicrmw are 10564 visible at system scope. 10565 10566 2. s_waitcnt lgkmcnt(0) & 10567 vmcnt(0) 10568 10569 - If TgSplit execution mode, 10570 omit lgkmcnt(0). 10571 - If OpenCL, omit 10572 lgkmcnt(0). 10573 - Could be split into 10574 separate s_waitcnt 10575 vmcnt(0) and 10576 s_waitcnt 10577 lgkmcnt(0) to allow 10578 them to be 10579 independently moved 10580 according to the 10581 following rules. 10582 - s_waitcnt vmcnt(0) 10583 must happen after 10584 any preceding 10585 global/generic 10586 load/store/load 10587 atomic/store 10588 atomic/atomicrmw. 10589 - s_waitcnt lgkmcnt(0) 10590 must happen after 10591 any preceding 10592 local/generic 10593 load/store/load 10594 atomic/store 10595 atomic/atomicrmw. 10596 - Must happen before 10597 the following 10598 atomicrmw. 10599 - Ensures that all 10600 memory operations 10601 to global and L2 writeback 10602 have completed before 10603 performing the 10604 atomicrmw that is 10605 being released. 10606 10607 3. flat_atomic sc1=1 10608 4. s_waitcnt vmcnt(0) & 10609 lgkmcnt(0) 10610 10611 - If TgSplit execution mode, 10612 omit lgkmcnt(0). 10613 - If OpenCL, omit 10614 lgkmcnt(0). 10615 - Must happen before 10616 following 10617 buffer_inv. 10618 - Ensures the 10619 atomicrmw has 10620 completed before 10621 invalidating the 10622 caches. 10623 10624 5. buffer_inv sc0=1 sc1=1 10625 10626 - Must happen before 10627 any following 10628 global/generic 10629 load/load 10630 atomic/atomicrmw. 10631 - Ensures that 10632 following loads 10633 will not see stale 10634 MTYPE NC global data. 10635 MTYPE RW and CC memory will 10636 never be stale due to the 10637 memory probes. 10638 10639 fence acq_rel - singlethread *none* *none* 10640 - wavefront 10641 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 10642 10643 - Use lgkmcnt(0) if not 10644 TgSplit execution mode 10645 and vmcnt(0) if TgSplit 10646 execution mode. 10647 - If OpenCL and 10648 address space is 10649 not generic, omit 10650 lgkmcnt(0). 10651 - If OpenCL and 10652 address space is 10653 local, omit 10654 vmcnt(0). 10655 - However, 10656 since LLVM 10657 currently has no 10658 address space on 10659 the fence need to 10660 conservatively 10661 always generate 10662 (see comment for 10663 previous fence). 10664 - s_waitcnt vmcnt(0) 10665 must happen after 10666 any preceding 10667 global/generic 10668 load/store/ 10669 load atomic/store atomic/ 10670 atomicrmw. 10671 - s_waitcnt lgkmcnt(0) 10672 must happen after 10673 any preceding 10674 local/generic 10675 load/load 10676 atomic/store/store 10677 atomic/atomicrmw. 10678 - Must happen before 10679 any following 10680 global/generic 10681 load/load 10682 atomic/store/store 10683 atomic/atomicrmw. 10684 - Ensures that all 10685 memory operations 10686 have 10687 completed before 10688 performing any 10689 following global 10690 memory operations. 10691 - Ensures that the 10692 preceding 10693 local/generic load 10694 atomic/atomicrmw 10695 with an equal or 10696 wider sync scope 10697 and memory ordering 10698 stronger than 10699 unordered (this is 10700 termed the 10701 acquire-fence-paired-atomic) 10702 has completed 10703 before following 10704 global memory 10705 operations. This 10706 satisfies the 10707 requirements of 10708 acquire. 10709 - Ensures that all 10710 previous memory 10711 operations have 10712 completed before a 10713 following 10714 local/generic store 10715 atomic/atomicrmw 10716 with an equal or 10717 wider sync scope 10718 and memory ordering 10719 stronger than 10720 unordered (this is 10721 termed the 10722 release-fence-paired-atomic). 10723 This satisfies the 10724 requirements of 10725 release. 10726 - Must happen before 10727 the following 10728 buffer_inv. 10729 - Ensures that the 10730 acquire-fence-paired 10731 atomic has completed 10732 before invalidating 10733 the 10734 cache. Therefore 10735 any following 10736 locations read must 10737 be no older than 10738 the value read by 10739 the 10740 acquire-fence-paired-atomic. 10741 10742 3. buffer_inv sc0=1 10743 10744 - If not TgSplit execution 10745 mode, omit. 10746 - Ensures that 10747 following 10748 loads will not see 10749 stale data. 10750 10751 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1 10752 10753 - If OpenCL and 10754 address space is 10755 local, omit. 10756 - Must happen before 10757 following s_waitcnt. 10758 - Performs L2 writeback to 10759 ensure previous 10760 global/generic 10761 store/atomicrmw are 10762 visible at agent scope. 10763 10764 2. s_waitcnt lgkmcnt(0) & 10765 vmcnt(0) 10766 10767 - If TgSplit execution mode, 10768 omit lgkmcnt(0). 10769 - If OpenCL and 10770 address space is 10771 not generic, omit 10772 lgkmcnt(0). 10773 - However, since LLVM 10774 currently has no 10775 address space on 10776 the fence need to 10777 conservatively 10778 always generate 10779 (see comment for 10780 previous fence). 10781 - Could be split into 10782 separate s_waitcnt 10783 vmcnt(0) and 10784 s_waitcnt 10785 lgkmcnt(0) to allow 10786 them to be 10787 independently moved 10788 according to the 10789 following rules. 10790 - s_waitcnt vmcnt(0) 10791 must happen after 10792 any preceding 10793 global/generic 10794 load/store/load 10795 atomic/store 10796 atomic/atomicrmw. 10797 - s_waitcnt lgkmcnt(0) 10798 must happen after 10799 any preceding 10800 local/generic 10801 load/store/load 10802 atomic/store 10803 atomic/atomicrmw. 10804 - Must happen before 10805 the following 10806 buffer_inv. 10807 - Ensures that the 10808 preceding 10809 global/local/generic 10810 load 10811 atomic/atomicrmw 10812 with an equal or 10813 wider sync scope 10814 and memory ordering 10815 stronger than 10816 unordered (this is 10817 termed the 10818 acquire-fence-paired-atomic) 10819 has completed 10820 before invalidating 10821 the cache. This 10822 satisfies the 10823 requirements of 10824 acquire. 10825 - Ensures that all 10826 previous memory 10827 operations have 10828 completed before a 10829 following 10830 global/local/generic 10831 store 10832 atomic/atomicrmw 10833 with an equal or 10834 wider sync scope 10835 and memory ordering 10836 stronger than 10837 unordered (this is 10838 termed the 10839 release-fence-paired-atomic). 10840 This satisfies the 10841 requirements of 10842 release. 10843 10844 3. buffer_inv sc1=1 10845 10846 - Must happen before 10847 any following 10848 global/generic 10849 load/load 10850 atomic/store/store 10851 atomic/atomicrmw. 10852 - Ensures that 10853 following loads 10854 will not see stale 10855 global data. This 10856 satisfies the 10857 requirements of 10858 acquire. 10859 10860 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1 10861 10862 - If OpenCL and 10863 address space is 10864 local, omit. 10865 - Must happen before 10866 following s_waitcnt. 10867 - Performs L2 writeback to 10868 ensure previous 10869 global/generic 10870 store/atomicrmw are 10871 visible at system scope. 10872 10873 1. s_waitcnt lgkmcnt(0) & 10874 vmcnt(0) 10875 10876 - If TgSplit execution mode, 10877 omit lgkmcnt(0). 10878 - If OpenCL and 10879 address space is 10880 not generic, omit 10881 lgkmcnt(0). 10882 - However, since LLVM 10883 currently has no 10884 address space on 10885 the fence need to 10886 conservatively 10887 always generate 10888 (see comment for 10889 previous fence). 10890 - Could be split into 10891 separate s_waitcnt 10892 vmcnt(0) and 10893 s_waitcnt 10894 lgkmcnt(0) to allow 10895 them to be 10896 independently moved 10897 according to the 10898 following rules. 10899 - s_waitcnt vmcnt(0) 10900 must happen after 10901 any preceding 10902 global/generic 10903 load/store/load 10904 atomic/store 10905 atomic/atomicrmw. 10906 - s_waitcnt lgkmcnt(0) 10907 must happen after 10908 any preceding 10909 local/generic 10910 load/store/load 10911 atomic/store 10912 atomic/atomicrmw. 10913 - Must happen before 10914 the following 10915 buffer_inv. 10916 - Ensures that the 10917 preceding 10918 global/local/generic 10919 load 10920 atomic/atomicrmw 10921 with an equal or 10922 wider sync scope 10923 and memory ordering 10924 stronger than 10925 unordered (this is 10926 termed the 10927 acquire-fence-paired-atomic) 10928 has completed 10929 before invalidating 10930 the cache. This 10931 satisfies the 10932 requirements of 10933 acquire. 10934 - Ensures that all 10935 previous memory 10936 operations have 10937 completed before a 10938 following 10939 global/local/generic 10940 store 10941 atomic/atomicrmw 10942 with an equal or 10943 wider sync scope 10944 and memory ordering 10945 stronger than 10946 unordered (this is 10947 termed the 10948 release-fence-paired-atomic). 10949 This satisfies the 10950 requirements of 10951 release. 10952 10953 2. buffer_inv sc0=1 sc1=1 10954 10955 - Must happen before 10956 any following 10957 global/generic 10958 load/load 10959 atomic/store/store 10960 atomic/atomicrmw. 10961 - Ensures that 10962 following loads 10963 will not see stale 10964 MTYPE NC global data. 10965 MTYPE RW and CC memory will 10966 never be stale due to the 10967 memory probes. 10968 10969 **Sequential Consistent Atomic** 10970 ------------------------------------------------------------------------------------ 10971 load atomic seq_cst - singlethread - global *Same as corresponding 10972 - wavefront - local load atomic acquire, 10973 - generic except must generate 10974 all instructions even 10975 for OpenCL.* 10976 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10977 - generic 10978 - Use lgkmcnt(0) if not 10979 TgSplit execution mode 10980 and vmcnt(0) if TgSplit 10981 execution mode. 10982 - s_waitcnt lgkmcnt(0) must 10983 happen after 10984 preceding 10985 local/generic load 10986 atomic/store 10987 atomic/atomicrmw 10988 with memory 10989 ordering of seq_cst 10990 and with equal or 10991 wider sync scope. 10992 (Note that seq_cst 10993 fences have their 10994 own s_waitcnt 10995 lgkmcnt(0) and so do 10996 not need to be 10997 considered.) 10998 - s_waitcnt vmcnt(0) 10999 must happen after 11000 preceding 11001 global/generic load 11002 atomic/store 11003 atomic/atomicrmw 11004 with memory 11005 ordering of seq_cst 11006 and with equal or 11007 wider sync scope. 11008 (Note that seq_cst 11009 fences have their 11010 own s_waitcnt 11011 vmcnt(0) and so do 11012 not need to be 11013 considered.) 11014 - Ensures any 11015 preceding 11016 sequential 11017 consistent global/local 11018 memory instructions 11019 have completed 11020 before executing 11021 this sequentially 11022 consistent 11023 instruction. This 11024 prevents reordering 11025 a seq_cst store 11026 followed by a 11027 seq_cst load. (Note 11028 that seq_cst is 11029 stronger than 11030 acquire/release as 11031 the reordering of 11032 load acquire 11033 followed by a store 11034 release is 11035 prevented by the 11036 s_waitcnt of 11037 the release, but 11038 there is nothing 11039 preventing a store 11040 release followed by 11041 load acquire from 11042 completing out of 11043 order. The s_waitcnt 11044 could be placed after 11045 seq_store or before 11046 the seq_load. We 11047 choose the load to 11048 make the s_waitcnt be 11049 as late as possible 11050 so that the store 11051 may have already 11052 completed.) 11053 11054 2. *Following 11055 instructions same as 11056 corresponding load 11057 atomic acquire, 11058 except must generate 11059 all instructions even 11060 for OpenCL.* 11061 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 11062 local address space cannot 11063 be used.* 11064 11065 *Same as corresponding 11066 load atomic acquire, 11067 except must generate 11068 all instructions even 11069 for OpenCL.* 11070 11071 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 11072 - system - generic vmcnt(0) 11073 11074 - If TgSplit execution mode, 11075 omit lgkmcnt(0). 11076 - Could be split into 11077 separate s_waitcnt 11078 vmcnt(0) 11079 and s_waitcnt 11080 lgkmcnt(0) to allow 11081 them to be 11082 independently moved 11083 according to the 11084 following rules. 11085 - s_waitcnt lgkmcnt(0) 11086 must happen after 11087 preceding 11088 global/generic load 11089 atomic/store 11090 atomic/atomicrmw 11091 with memory 11092 ordering of seq_cst 11093 and with equal or 11094 wider sync scope. 11095 (Note that seq_cst 11096 fences have their 11097 own s_waitcnt 11098 lgkmcnt(0) and so do 11099 not need to be 11100 considered.) 11101 - s_waitcnt vmcnt(0) 11102 must happen after 11103 preceding 11104 global/generic load 11105 atomic/store 11106 atomic/atomicrmw 11107 with memory 11108 ordering of seq_cst 11109 and with equal or 11110 wider sync scope. 11111 (Note that seq_cst 11112 fences have their 11113 own s_waitcnt 11114 vmcnt(0) and so do 11115 not need to be 11116 considered.) 11117 - Ensures any 11118 preceding 11119 sequential 11120 consistent global 11121 memory instructions 11122 have completed 11123 before executing 11124 this sequentially 11125 consistent 11126 instruction. This 11127 prevents reordering 11128 a seq_cst store 11129 followed by a 11130 seq_cst load. (Note 11131 that seq_cst is 11132 stronger than 11133 acquire/release as 11134 the reordering of 11135 load acquire 11136 followed by a store 11137 release is 11138 prevented by the 11139 s_waitcnt of 11140 the release, but 11141 there is nothing 11142 preventing a store 11143 release followed by 11144 load acquire from 11145 completing out of 11146 order. The s_waitcnt 11147 could be placed after 11148 seq_store or before 11149 the seq_load. We 11150 choose the load to 11151 make the s_waitcnt be 11152 as late as possible 11153 so that the store 11154 may have already 11155 completed.) 11156 11157 2. *Following 11158 instructions same as 11159 corresponding load 11160 atomic acquire, 11161 except must generate 11162 all instructions even 11163 for OpenCL.* 11164 store atomic seq_cst - singlethread - global *Same as corresponding 11165 - wavefront - local store atomic release, 11166 - workgroup - generic except must generate 11167 - agent all instructions even 11168 - system for OpenCL.* 11169 atomicrmw seq_cst - singlethread - global *Same as corresponding 11170 - wavefront - local atomicrmw acq_rel, 11171 - workgroup - generic except must generate 11172 - agent all instructions even 11173 - system for OpenCL.* 11174 fence seq_cst - singlethread *none* *Same as corresponding 11175 - wavefront fence acq_rel, 11176 - workgroup except must generate 11177 - agent all instructions even 11178 - system for OpenCL.* 11179 ============ ============ ============== ========== ================================ 11180 11181.. _amdgpu-amdhsa-memory-model-gfx10-gfx11: 11182 11183Memory Model GFX10-GFX11 11184++++++++++++++++++++++++ 11185 11186For GFX10-GFX11: 11187 11188* Each agent has multiple shader arrays (SA). 11189* Each SA has multiple work-group processors (WGP). 11190* Each WGP has multiple compute units (CU). 11191* Each CU has multiple SIMDs that execute wavefronts. 11192* The wavefronts for a single work-group are executed in the same 11193 WGP. In CU wavefront execution mode the wavefronts may be executed by 11194 different SIMDs in the same CU. In WGP wavefront execution mode the 11195 wavefronts may be executed by different SIMDs in different CUs in the same 11196 WGP. 11197* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 11198 executing on it. 11199* All LDS operations of a WGP are performed as wavefront wide operations in a 11200 global order and involve no caching. Completion is reported to a wavefront in 11201 execution order. 11202* The LDS memory has multiple request queues shared by the SIMDs of a 11203 WGP. Therefore, the LDS operations performed by different wavefronts of a 11204 work-group can be reordered relative to each other, which can result in 11205 reordering the visibility of vector memory operations with respect to LDS 11206 operations of other wavefronts in the same work-group. A ``s_waitcnt 11207 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 11208 vector memory operations between wavefronts of a work-group, but not between 11209 operations performed by the same wavefront. 11210* The vector memory operations are performed as wavefront wide operations. 11211 Completion of load/store/sample operations are reported to a wavefront in 11212 execution order of other load/store/sample operations performed by that 11213 wavefront. 11214* The vector memory operations access a vector L0 cache. There is a single L0 11215 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 11216 special action is required for coherence between the lanes of a single 11217 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 11218 wavefronts executing in the same work-group as they may be executing on SIMDs 11219 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 11220 required for coherence between wavefronts executing in different work-groups 11221 as they may be executing on different WGPs. 11222* The scalar memory operations access a scalar L0 cache shared by all wavefronts 11223 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 11224 operations are used in a restricted way so do not impact the memory model. See 11225 :ref:`amdgpu-amdhsa-memory-spaces`. 11226* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 11227 the same SA. Therefore, no special action is required for coherence between 11228 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 11229 required for coherence between wavefronts executing in different work-groups 11230 as they may be executing on different SAs that access different L1s. 11231* The L1 caches have independent quadrants to service disjoint ranges of virtual 11232 addresses. 11233* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 11234 vector and scalar memory operations performed by different wavefronts, whether 11235 executing in the same or different work-groups (which may be executing on 11236 different CUs accessing different L0s), can be reordered relative to each 11237 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 11238 synchronization between vector memory operations of different wavefronts. It 11239 ensures a previous vector memory operation has completed before executing a 11240 subsequent vector memory or LDS operation and so can be used to meet the 11241 requirements of acquire, release and sequential consistency. 11242* The L1 caches use an L2 cache shared by all SAs on the same agent. 11243* The L2 cache has independent channels to service disjoint ranges of virtual 11244 addresses. 11245* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 11246 quadrant has a separate request queue per L2 channel. Therefore, the vector 11247 and scalar memory operations performed by wavefronts executing in different 11248 work-groups (which may be executing on different SAs) of an agent can be 11249 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 11250 required to ensure synchronization between vector memory operations of 11251 different SAs. It ensures a previous vector memory operation has completed 11252 before executing a subsequent vector memory and so can be used to meet the 11253 requirements of acquire, release and sequential consistency. 11254* The L2 cache can be kept coherent with other agents on some targets, or ranges 11255 of virtual addresses can be set up to bypass it to ensure system coherence. 11256* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory. 11257 The MALL cache is fully coherent with GPU memory and has no impact on system 11258 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 11259 11260Scalar memory operations are only used to access memory that is proven to not 11261change during the execution of the kernel dispatch. This includes constant 11262address space and global address space for program scope ``const`` variables. 11263Therefore, the kernel machine code does not have to maintain the scalar cache to 11264ensure it is coherent with the vector caches. The scalar and vector caches are 11265invalidated between kernel dispatches by CP since constant address space data 11266may change between kernel dispatch executions. See 11267:ref:`amdgpu-amdhsa-memory-spaces`. 11268 11269The one exception is if scalar writes are used to spill SGPR registers. In this 11270case the AMDGPU backend ensures the memory location used to spill is never 11271accessed by vector memory operations at the same time. If scalar writes are used 11272then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 11273return since the locations may be used for vector memory instructions by a 11274future wavefront that uses the same scratch area, or a function call that 11275creates a frame at the same address, respectively. There is no need for a 11276``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 11277 11278For kernarg backing memory: 11279 11280* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 11281* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 11282 needing to invalidate the L2 cache. 11283* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 11284 so the L2 cache will be coherent with the CPU and other agents. 11285 11286Scratch backing memory (which is used for the private address space) is accessed 11287with MTYPE NC (non-coherent). Since the private address space is only accessed 11288by a single thread, and is always write-before-read, there is never a need to 11289invalidate these entries from the L0 or L1 caches. 11290 11291Wavefronts are executed in native mode with in-order reporting of loads and 11292sample instructions. In this mode vmcnt reports completion of load, atomic with 11293return and sample instructions in order, and the vscnt reports the completion of 11294store and atomic without return in order. See ``MEM_ORDERED`` field in 11295:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 11296 11297Wavefronts can be executed in WGP or CU wavefront execution mode: 11298 11299* In WGP wavefront execution mode the wavefronts of a work-group are executed 11300 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 11301 CU L0 caches is required for work-group synchronization. Also accesses to L1 11302 at work-group scope need to be explicitly ordered as the accesses from 11303 different CUs are not ordered. 11304* In CU wavefront execution mode the wavefronts of a work-group are executed on 11305 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 11306 the work-group access the same L0 which in turn ensures L1 accesses are 11307 ordered and so do not require explicit management of the caches for 11308 work-group synchronization. 11309 11310See ``WGP_MODE`` field in 11311:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and 11312:ref:`amdgpu-target-features`. 11313 11314The code sequences used to implement the memory model for GFX10-GFX11 are defined in 11315table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`. 11316 11317 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11 11318 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table 11319 11320 ============ ============ ============== ========== ================================ 11321 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 11322 Ordering Sync Scope Address GFX10-GFX11 11323 Space 11324 ============ ============ ============== ========== ================================ 11325 **Non-Atomic** 11326 ------------------------------------------------------------------------------------ 11327 load *none* *none* - global - !volatile & !nontemporal 11328 - generic 11329 - private 1. buffer/global/flat_load 11330 - constant 11331 - !volatile & nontemporal 11332 11333 1. buffer/global/flat_load 11334 slc=1 dlc=1 11335 11336 - If GFX10, omit dlc=1. 11337 11338 - volatile 11339 11340 1. buffer/global/flat_load 11341 glc=1 dlc=1 11342 11343 2. s_waitcnt vmcnt(0) 11344 11345 - Must happen before 11346 any following volatile 11347 global/generic 11348 load/store. 11349 - Ensures that 11350 volatile 11351 operations to 11352 different 11353 addresses will not 11354 be reordered by 11355 hardware. 11356 11357 load *none* *none* - local 1. ds_load 11358 store *none* *none* - global - !volatile & !nontemporal 11359 - generic 11360 - private 1. buffer/global/flat_store 11361 - constant 11362 - !volatile & nontemporal 11363 11364 1. buffer/global/flat_store 11365 glc=1 slc=1 dlc=1 11366 11367 - If GFX10, omit dlc=1. 11368 11369 - volatile 11370 11371 1. buffer/global/flat_store 11372 dlc=1 11373 11374 - If GFX10, omit dlc=1. 11375 11376 2. s_waitcnt vscnt(0) 11377 11378 - Must happen before 11379 any following volatile 11380 global/generic 11381 load/store. 11382 - Ensures that 11383 volatile 11384 operations to 11385 different 11386 addresses will not 11387 be reordered by 11388 hardware. 11389 11390 store *none* *none* - local 1. ds_store 11391 **Unordered Atomic** 11392 ------------------------------------------------------------------------------------ 11393 load atomic unordered *any* *any* *Same as non-atomic*. 11394 store atomic unordered *any* *any* *Same as non-atomic*. 11395 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 11396 **Monotonic Atomic** 11397 ------------------------------------------------------------------------------------ 11398 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 11399 - wavefront - generic 11400 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 11401 - generic glc=1 11402 11403 - If CU wavefront execution 11404 mode, omit glc=1. 11405 11406 load atomic monotonic - singlethread - local 1. ds_load 11407 - wavefront 11408 - workgroup 11409 load atomic monotonic - agent - global 1. buffer/global/flat_load 11410 - system - generic glc=1 dlc=1 11411 11412 - If GFX11, omit dlc=1. 11413 11414 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 11415 - wavefront - generic 11416 - workgroup 11417 - agent 11418 - system 11419 store atomic monotonic - singlethread - local 1. ds_store 11420 - wavefront 11421 - workgroup 11422 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 11423 - wavefront - generic 11424 - workgroup 11425 - agent 11426 - system 11427 atomicrmw monotonic - singlethread - local 1. ds_atomic 11428 - wavefront 11429 - workgroup 11430 **Acquire Atomic** 11431 ------------------------------------------------------------------------------------ 11432 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 11433 - wavefront - local 11434 - generic 11435 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 11436 11437 - If CU wavefront execution 11438 mode, omit glc=1. 11439 11440 2. s_waitcnt vmcnt(0) 11441 11442 - If CU wavefront execution 11443 mode, omit. 11444 - Must happen before 11445 the following buffer_gl0_inv 11446 and before any following 11447 global/generic 11448 load/load 11449 atomic/store/store 11450 atomic/atomicrmw. 11451 11452 3. buffer_gl0_inv 11453 11454 - If CU wavefront execution 11455 mode, omit. 11456 - Ensures that 11457 following 11458 loads will not see 11459 stale data. 11460 11461 load atomic acquire - workgroup - local 1. ds_load 11462 2. s_waitcnt lgkmcnt(0) 11463 11464 - If OpenCL, omit. 11465 - Must happen before 11466 the following buffer_gl0_inv 11467 and before any following 11468 global/generic load/load 11469 atomic/store/store 11470 atomic/atomicrmw. 11471 - Ensures any 11472 following global 11473 data read is no 11474 older than the local load 11475 atomic value being 11476 acquired. 11477 11478 3. buffer_gl0_inv 11479 11480 - If CU wavefront execution 11481 mode, omit. 11482 - If OpenCL, omit. 11483 - Ensures that 11484 following 11485 loads will not see 11486 stale data. 11487 11488 load atomic acquire - workgroup - generic 1. flat_load glc=1 11489 11490 - If CU wavefront execution 11491 mode, omit glc=1. 11492 11493 2. s_waitcnt lgkmcnt(0) & 11494 vmcnt(0) 11495 11496 - If CU wavefront execution 11497 mode, omit vmcnt(0). 11498 - If OpenCL, omit 11499 lgkmcnt(0). 11500 - Must happen before 11501 the following 11502 buffer_gl0_inv and any 11503 following global/generic 11504 load/load 11505 atomic/store/store 11506 atomic/atomicrmw. 11507 - Ensures any 11508 following global 11509 data read is no 11510 older than a local load 11511 atomic value being 11512 acquired. 11513 11514 3. buffer_gl0_inv 11515 11516 - If CU wavefront execution 11517 mode, omit. 11518 - Ensures that 11519 following 11520 loads will not see 11521 stale data. 11522 11523 load atomic acquire - agent - global 1. buffer/global_load 11524 - system glc=1 dlc=1 11525 11526 - If GFX11, omit dlc=1. 11527 11528 2. s_waitcnt vmcnt(0) 11529 11530 - Must happen before 11531 following 11532 buffer_gl*_inv. 11533 - Ensures the load 11534 has completed 11535 before invalidating 11536 the caches. 11537 11538 3. buffer_gl0_inv; 11539 buffer_gl1_inv 11540 11541 - Must happen before 11542 any following 11543 global/generic 11544 load/load 11545 atomic/atomicrmw. 11546 - Ensures that 11547 following 11548 loads will not see 11549 stale global data. 11550 11551 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 11552 - system 11553 - If GFX11, omit dlc=1. 11554 11555 2. s_waitcnt vmcnt(0) & 11556 lgkmcnt(0) 11557 11558 - If OpenCL omit 11559 lgkmcnt(0). 11560 - Must happen before 11561 following 11562 buffer_gl*_invl. 11563 - Ensures the flat_load 11564 has completed 11565 before invalidating 11566 the caches. 11567 11568 3. buffer_gl0_inv; 11569 buffer_gl1_inv 11570 11571 - Must happen before 11572 any following 11573 global/generic 11574 load/load 11575 atomic/atomicrmw. 11576 - Ensures that 11577 following loads 11578 will not see stale 11579 global data. 11580 11581 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 11582 - wavefront - local 11583 - generic 11584 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 11585 2. s_waitcnt vm/vscnt(0) 11586 11587 - If CU wavefront execution 11588 mode, omit. 11589 - Use vmcnt(0) if atomic with 11590 return and vscnt(0) if 11591 atomic with no-return. 11592 - Must happen before 11593 the following buffer_gl0_inv 11594 and before any following 11595 global/generic 11596 load/load 11597 atomic/store/store 11598 atomic/atomicrmw. 11599 11600 3. buffer_gl0_inv 11601 11602 - If CU wavefront execution 11603 mode, omit. 11604 - Ensures that 11605 following 11606 loads will not see 11607 stale data. 11608 11609 atomicrmw acquire - workgroup - local 1. ds_atomic 11610 2. s_waitcnt lgkmcnt(0) 11611 11612 - If OpenCL, omit. 11613 - Must happen before 11614 the following 11615 buffer_gl0_inv. 11616 - Ensures any 11617 following global 11618 data read is no 11619 older than the local 11620 atomicrmw value 11621 being acquired. 11622 11623 3. buffer_gl0_inv 11624 11625 - If OpenCL omit. 11626 - Ensures that 11627 following 11628 loads will not see 11629 stale data. 11630 11631 atomicrmw acquire - workgroup - generic 1. flat_atomic 11632 2. s_waitcnt lgkmcnt(0) & 11633 vm/vscnt(0) 11634 11635 - If CU wavefront execution 11636 mode, omit vm/vscnt(0). 11637 - If OpenCL, omit lgkmcnt(0). 11638 - Use vmcnt(0) if atomic with 11639 return and vscnt(0) if 11640 atomic with no-return. 11641 - Must happen before 11642 the following 11643 buffer_gl0_inv. 11644 - Ensures any 11645 following global 11646 data read is no 11647 older than a local 11648 atomicrmw value 11649 being acquired. 11650 11651 3. buffer_gl0_inv 11652 11653 - If CU wavefront execution 11654 mode, omit. 11655 - Ensures that 11656 following 11657 loads will not see 11658 stale data. 11659 11660 atomicrmw acquire - agent - global 1. buffer/global_atomic 11661 - system 2. s_waitcnt vm/vscnt(0) 11662 11663 - Use vmcnt(0) if atomic with 11664 return and vscnt(0) if 11665 atomic with no-return. 11666 - Must happen before 11667 following 11668 buffer_gl*_inv. 11669 - Ensures the 11670 atomicrmw has 11671 completed before 11672 invalidating the 11673 caches. 11674 11675 3. buffer_gl0_inv; 11676 buffer_gl1_inv 11677 11678 - Must happen before 11679 any following 11680 global/generic 11681 load/load 11682 atomic/atomicrmw. 11683 - Ensures that 11684 following loads 11685 will not see stale 11686 global data. 11687 11688 atomicrmw acquire - agent - generic 1. flat_atomic 11689 - system 2. s_waitcnt vm/vscnt(0) & 11690 lgkmcnt(0) 11691 11692 - If OpenCL, omit 11693 lgkmcnt(0). 11694 - Use vmcnt(0) if atomic with 11695 return and vscnt(0) if 11696 atomic with no-return. 11697 - Must happen before 11698 following 11699 buffer_gl*_inv. 11700 - Ensures the 11701 atomicrmw has 11702 completed before 11703 invalidating the 11704 caches. 11705 11706 3. buffer_gl0_inv; 11707 buffer_gl1_inv 11708 11709 - Must happen before 11710 any following 11711 global/generic 11712 load/load 11713 atomic/atomicrmw. 11714 - Ensures that 11715 following loads 11716 will not see stale 11717 global data. 11718 11719 fence acquire - singlethread *none* *none* 11720 - wavefront 11721 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 11722 vmcnt(0) & vscnt(0) 11723 11724 - If CU wavefront execution 11725 mode, omit vmcnt(0) and 11726 vscnt(0). 11727 - If OpenCL and 11728 address space is 11729 not generic, omit 11730 lgkmcnt(0). 11731 - If OpenCL and 11732 address space is 11733 local, omit 11734 vmcnt(0) and vscnt(0). 11735 - However, since LLVM 11736 currently has no 11737 address space on 11738 the fence need to 11739 conservatively 11740 always generate. If 11741 fence had an 11742 address space then 11743 set to address 11744 space of OpenCL 11745 fence flag, or to 11746 generic if both 11747 local and global 11748 flags are 11749 specified. 11750 - Could be split into 11751 separate s_waitcnt 11752 vmcnt(0), s_waitcnt 11753 vscnt(0) and s_waitcnt 11754 lgkmcnt(0) to allow 11755 them to be 11756 independently moved 11757 according to the 11758 following rules. 11759 - s_waitcnt vmcnt(0) 11760 must happen after 11761 any preceding 11762 global/generic load 11763 atomic/ 11764 atomicrmw-with-return-value 11765 with an equal or 11766 wider sync scope 11767 and memory ordering 11768 stronger than 11769 unordered (this is 11770 termed the 11771 fence-paired-atomic). 11772 - s_waitcnt vscnt(0) 11773 must happen after 11774 any preceding 11775 global/generic 11776 atomicrmw-no-return-value 11777 with an equal or 11778 wider sync scope 11779 and memory ordering 11780 stronger than 11781 unordered (this is 11782 termed the 11783 fence-paired-atomic). 11784 - s_waitcnt lgkmcnt(0) 11785 must happen after 11786 any preceding 11787 local/generic load 11788 atomic/atomicrmw 11789 with an equal or 11790 wider sync scope 11791 and memory ordering 11792 stronger than 11793 unordered (this is 11794 termed the 11795 fence-paired-atomic). 11796 - Must happen before 11797 the following 11798 buffer_gl0_inv. 11799 - Ensures that the 11800 fence-paired atomic 11801 has completed 11802 before invalidating 11803 the 11804 cache. Therefore 11805 any following 11806 locations read must 11807 be no older than 11808 the value read by 11809 the 11810 fence-paired-atomic. 11811 11812 3. buffer_gl0_inv 11813 11814 - If CU wavefront execution 11815 mode, omit. 11816 - Ensures that 11817 following 11818 loads will not see 11819 stale data. 11820 11821 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 11822 - system vmcnt(0) & vscnt(0) 11823 11824 - If OpenCL and 11825 address space is 11826 not generic, omit 11827 lgkmcnt(0). 11828 - If OpenCL and 11829 address space is 11830 local, omit 11831 vmcnt(0) and vscnt(0). 11832 - However, since LLVM 11833 currently has no 11834 address space on 11835 the fence need to 11836 conservatively 11837 always generate 11838 (see comment for 11839 previous fence). 11840 - Could be split into 11841 separate s_waitcnt 11842 vmcnt(0), s_waitcnt 11843 vscnt(0) and s_waitcnt 11844 lgkmcnt(0) to allow 11845 them to be 11846 independently moved 11847 according to the 11848 following rules. 11849 - s_waitcnt vmcnt(0) 11850 must happen after 11851 any preceding 11852 global/generic load 11853 atomic/ 11854 atomicrmw-with-return-value 11855 with an equal or 11856 wider sync scope 11857 and memory ordering 11858 stronger than 11859 unordered (this is 11860 termed the 11861 fence-paired-atomic). 11862 - s_waitcnt vscnt(0) 11863 must happen after 11864 any preceding 11865 global/generic 11866 atomicrmw-no-return-value 11867 with an equal or 11868 wider sync scope 11869 and memory ordering 11870 stronger than 11871 unordered (this is 11872 termed the 11873 fence-paired-atomic). 11874 - s_waitcnt lgkmcnt(0) 11875 must happen after 11876 any preceding 11877 local/generic load 11878 atomic/atomicrmw 11879 with an equal or 11880 wider sync scope 11881 and memory ordering 11882 stronger than 11883 unordered (this is 11884 termed the 11885 fence-paired-atomic). 11886 - Must happen before 11887 the following 11888 buffer_gl*_inv. 11889 - Ensures that the 11890 fence-paired atomic 11891 has completed 11892 before invalidating 11893 the 11894 caches. Therefore 11895 any following 11896 locations read must 11897 be no older than 11898 the value read by 11899 the 11900 fence-paired-atomic. 11901 11902 2. buffer_gl0_inv; 11903 buffer_gl1_inv 11904 11905 - Must happen before any 11906 following global/generic 11907 load/load 11908 atomic/store/store 11909 atomic/atomicrmw. 11910 - Ensures that 11911 following loads 11912 will not see stale 11913 global data. 11914 11915 **Release Atomic** 11916 ------------------------------------------------------------------------------------ 11917 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 11918 - wavefront - local 11919 - generic 11920 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 11921 - generic vmcnt(0) & vscnt(0) 11922 11923 - If CU wavefront execution 11924 mode, omit vmcnt(0) and 11925 vscnt(0). 11926 - If OpenCL, omit 11927 lgkmcnt(0). 11928 - Could be split into 11929 separate s_waitcnt 11930 vmcnt(0), s_waitcnt 11931 vscnt(0) and s_waitcnt 11932 lgkmcnt(0) to allow 11933 them to be 11934 independently moved 11935 according to the 11936 following rules. 11937 - s_waitcnt vmcnt(0) 11938 must happen after 11939 any preceding 11940 global/generic load/load 11941 atomic/ 11942 atomicrmw-with-return-value. 11943 - s_waitcnt vscnt(0) 11944 must happen after 11945 any preceding 11946 global/generic 11947 store/store 11948 atomic/ 11949 atomicrmw-no-return-value. 11950 - s_waitcnt lgkmcnt(0) 11951 must happen after 11952 any preceding 11953 local/generic 11954 load/store/load 11955 atomic/store 11956 atomic/atomicrmw. 11957 - Must happen before 11958 the following 11959 store. 11960 - Ensures that all 11961 memory operations 11962 have 11963 completed before 11964 performing the 11965 store that is being 11966 released. 11967 11968 2. buffer/global/flat_store 11969 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 11970 11971 - If CU wavefront execution 11972 mode, omit. 11973 - If OpenCL, omit. 11974 - Could be split into 11975 separate s_waitcnt 11976 vmcnt(0) and s_waitcnt 11977 vscnt(0) to allow 11978 them to be 11979 independently moved 11980 according to the 11981 following rules. 11982 - s_waitcnt vmcnt(0) 11983 must happen after 11984 any preceding 11985 global/generic load/load 11986 atomic/ 11987 atomicrmw-with-return-value. 11988 - s_waitcnt vscnt(0) 11989 must happen after 11990 any preceding 11991 global/generic 11992 store/store atomic/ 11993 atomicrmw-no-return-value. 11994 - Must happen before 11995 the following 11996 store. 11997 - Ensures that all 11998 global memory 11999 operations have 12000 completed before 12001 performing the 12002 store that is being 12003 released. 12004 12005 2. ds_store 12006 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 12007 - system - generic vmcnt(0) & vscnt(0) 12008 12009 - If OpenCL and 12010 address space is 12011 not generic, omit 12012 lgkmcnt(0). 12013 - Could be split into 12014 separate s_waitcnt 12015 vmcnt(0), s_waitcnt vscnt(0) 12016 and s_waitcnt 12017 lgkmcnt(0) to allow 12018 them to be 12019 independently moved 12020 according to the 12021 following rules. 12022 - s_waitcnt vmcnt(0) 12023 must happen after 12024 any preceding 12025 global/generic 12026 load/load 12027 atomic/ 12028 atomicrmw-with-return-value. 12029 - s_waitcnt vscnt(0) 12030 must happen after 12031 any preceding 12032 global/generic 12033 store/store atomic/ 12034 atomicrmw-no-return-value. 12035 - s_waitcnt lgkmcnt(0) 12036 must happen after 12037 any preceding 12038 local/generic 12039 load/store/load 12040 atomic/store 12041 atomic/atomicrmw. 12042 - Must happen before 12043 the following 12044 store. 12045 - Ensures that all 12046 memory operations 12047 have 12048 completed before 12049 performing the 12050 store that is being 12051 released. 12052 12053 2. buffer/global/flat_store 12054 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 12055 - wavefront - local 12056 - generic 12057 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12058 - generic vmcnt(0) & vscnt(0) 12059 12060 - If CU wavefront execution 12061 mode, omit vmcnt(0) and 12062 vscnt(0). 12063 - If OpenCL, omit lgkmcnt(0). 12064 - Could be split into 12065 separate s_waitcnt 12066 vmcnt(0), s_waitcnt 12067 vscnt(0) and s_waitcnt 12068 lgkmcnt(0) to allow 12069 them to be 12070 independently moved 12071 according to the 12072 following rules. 12073 - s_waitcnt vmcnt(0) 12074 must happen after 12075 any preceding 12076 global/generic load/load 12077 atomic/ 12078 atomicrmw-with-return-value. 12079 - s_waitcnt vscnt(0) 12080 must happen after 12081 any preceding 12082 global/generic 12083 store/store 12084 atomic/ 12085 atomicrmw-no-return-value. 12086 - s_waitcnt lgkmcnt(0) 12087 must happen after 12088 any preceding 12089 local/generic 12090 load/store/load 12091 atomic/store 12092 atomic/atomicrmw. 12093 - Must happen before 12094 the following 12095 atomicrmw. 12096 - Ensures that all 12097 memory operations 12098 have 12099 completed before 12100 performing the 12101 atomicrmw that is 12102 being released. 12103 12104 2. buffer/global/flat_atomic 12105 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12106 12107 - If CU wavefront execution 12108 mode, omit. 12109 - If OpenCL, omit. 12110 - Could be split into 12111 separate s_waitcnt 12112 vmcnt(0) and s_waitcnt 12113 vscnt(0) to allow 12114 them to be 12115 independently moved 12116 according to the 12117 following rules. 12118 - s_waitcnt vmcnt(0) 12119 must happen after 12120 any preceding 12121 global/generic load/load 12122 atomic/ 12123 atomicrmw-with-return-value. 12124 - s_waitcnt vscnt(0) 12125 must happen after 12126 any preceding 12127 global/generic 12128 store/store atomic/ 12129 atomicrmw-no-return-value. 12130 - Must happen before 12131 the following 12132 store. 12133 - Ensures that all 12134 global memory 12135 operations have 12136 completed before 12137 performing the 12138 store that is being 12139 released. 12140 12141 2. ds_atomic 12142 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 12143 - system - generic vmcnt(0) & vscnt(0) 12144 12145 - If OpenCL, omit 12146 lgkmcnt(0). 12147 - Could be split into 12148 separate s_waitcnt 12149 vmcnt(0), s_waitcnt 12150 vscnt(0) and s_waitcnt 12151 lgkmcnt(0) to allow 12152 them to be 12153 independently moved 12154 according to the 12155 following rules. 12156 - s_waitcnt vmcnt(0) 12157 must happen after 12158 any preceding 12159 global/generic 12160 load/load atomic/ 12161 atomicrmw-with-return-value. 12162 - s_waitcnt vscnt(0) 12163 must happen after 12164 any preceding 12165 global/generic 12166 store/store atomic/ 12167 atomicrmw-no-return-value. 12168 - s_waitcnt lgkmcnt(0) 12169 must happen after 12170 any preceding 12171 local/generic 12172 load/store/load 12173 atomic/store 12174 atomic/atomicrmw. 12175 - Must happen before 12176 the following 12177 atomicrmw. 12178 - Ensures that all 12179 memory operations 12180 to global and local 12181 have completed 12182 before performing 12183 the atomicrmw that 12184 is being released. 12185 12186 2. buffer/global/flat_atomic 12187 fence release - singlethread *none* *none* 12188 - wavefront 12189 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12190 vmcnt(0) & vscnt(0) 12191 12192 - If CU wavefront execution 12193 mode, omit vmcnt(0) and 12194 vscnt(0). 12195 - If OpenCL and 12196 address space is 12197 not generic, omit 12198 lgkmcnt(0). 12199 - If OpenCL and 12200 address space is 12201 local, omit 12202 vmcnt(0) and vscnt(0). 12203 - However, since LLVM 12204 currently has no 12205 address space on 12206 the fence need to 12207 conservatively 12208 always generate. If 12209 fence had an 12210 address space then 12211 set to address 12212 space of OpenCL 12213 fence flag, or to 12214 generic if both 12215 local and global 12216 flags are 12217 specified. 12218 - Could be split into 12219 separate s_waitcnt 12220 vmcnt(0), s_waitcnt 12221 vscnt(0) and s_waitcnt 12222 lgkmcnt(0) to allow 12223 them to be 12224 independently moved 12225 according to the 12226 following rules. 12227 - s_waitcnt vmcnt(0) 12228 must happen after 12229 any preceding 12230 global/generic 12231 load/load 12232 atomic/ 12233 atomicrmw-with-return-value. 12234 - s_waitcnt vscnt(0) 12235 must happen after 12236 any preceding 12237 global/generic 12238 store/store atomic/ 12239 atomicrmw-no-return-value. 12240 - s_waitcnt lgkmcnt(0) 12241 must happen after 12242 any preceding 12243 local/generic 12244 load/store/load 12245 atomic/store atomic/ 12246 atomicrmw. 12247 - Must happen before 12248 any following store 12249 atomic/atomicrmw 12250 with an equal or 12251 wider sync scope 12252 and memory ordering 12253 stronger than 12254 unordered (this is 12255 termed the 12256 fence-paired-atomic). 12257 - Ensures that all 12258 memory operations 12259 have 12260 completed before 12261 performing the 12262 following 12263 fence-paired-atomic. 12264 12265 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 12266 - system vmcnt(0) & vscnt(0) 12267 12268 - If OpenCL and 12269 address space is 12270 not generic, omit 12271 lgkmcnt(0). 12272 - If OpenCL and 12273 address space is 12274 local, omit 12275 vmcnt(0) and vscnt(0). 12276 - However, since LLVM 12277 currently has no 12278 address space on 12279 the fence need to 12280 conservatively 12281 always generate. If 12282 fence had an 12283 address space then 12284 set to address 12285 space of OpenCL 12286 fence flag, or to 12287 generic if both 12288 local and global 12289 flags are 12290 specified. 12291 - Could be split into 12292 separate s_waitcnt 12293 vmcnt(0), s_waitcnt 12294 vscnt(0) and s_waitcnt 12295 lgkmcnt(0) to allow 12296 them to be 12297 independently moved 12298 according to the 12299 following rules. 12300 - s_waitcnt vmcnt(0) 12301 must happen after 12302 any preceding 12303 global/generic 12304 load/load atomic/ 12305 atomicrmw-with-return-value. 12306 - s_waitcnt vscnt(0) 12307 must happen after 12308 any preceding 12309 global/generic 12310 store/store atomic/ 12311 atomicrmw-no-return-value. 12312 - s_waitcnt lgkmcnt(0) 12313 must happen after 12314 any preceding 12315 local/generic 12316 load/store/load 12317 atomic/store 12318 atomic/atomicrmw. 12319 - Must happen before 12320 any following store 12321 atomic/atomicrmw 12322 with an equal or 12323 wider sync scope 12324 and memory ordering 12325 stronger than 12326 unordered (this is 12327 termed the 12328 fence-paired-atomic). 12329 - Ensures that all 12330 memory operations 12331 have 12332 completed before 12333 performing the 12334 following 12335 fence-paired-atomic. 12336 12337 **Acquire-Release Atomic** 12338 ------------------------------------------------------------------------------------ 12339 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 12340 - wavefront - local 12341 - generic 12342 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12343 vmcnt(0) & vscnt(0) 12344 12345 - If CU wavefront execution 12346 mode, omit vmcnt(0) and 12347 vscnt(0). 12348 - If OpenCL, omit 12349 lgkmcnt(0). 12350 - Must happen after 12351 any preceding 12352 local/generic 12353 load/store/load 12354 atomic/store 12355 atomic/atomicrmw. 12356 - Could be split into 12357 separate s_waitcnt 12358 vmcnt(0), s_waitcnt 12359 vscnt(0), and s_waitcnt 12360 lgkmcnt(0) to allow 12361 them to be 12362 independently moved 12363 according to the 12364 following rules. 12365 - s_waitcnt vmcnt(0) 12366 must happen after 12367 any preceding 12368 global/generic load/load 12369 atomic/ 12370 atomicrmw-with-return-value. 12371 - s_waitcnt vscnt(0) 12372 must happen after 12373 any preceding 12374 global/generic 12375 store/store 12376 atomic/ 12377 atomicrmw-no-return-value. 12378 - s_waitcnt lgkmcnt(0) 12379 must happen after 12380 any preceding 12381 local/generic 12382 load/store/load 12383 atomic/store 12384 atomic/atomicrmw. 12385 - Must happen before 12386 the following 12387 atomicrmw. 12388 - Ensures that all 12389 memory operations 12390 have 12391 completed before 12392 performing the 12393 atomicrmw that is 12394 being released. 12395 12396 2. buffer/global_atomic 12397 3. s_waitcnt vm/vscnt(0) 12398 12399 - If CU wavefront execution 12400 mode, omit. 12401 - Use vmcnt(0) if atomic with 12402 return and vscnt(0) if 12403 atomic with no-return. 12404 - Must happen before 12405 the following 12406 buffer_gl0_inv. 12407 - Ensures any 12408 following global 12409 data read is no 12410 older than the 12411 atomicrmw value 12412 being acquired. 12413 12414 4. buffer_gl0_inv 12415 12416 - If CU wavefront execution 12417 mode, omit. 12418 - Ensures that 12419 following 12420 loads will not see 12421 stale data. 12422 12423 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12424 12425 - If CU wavefront execution 12426 mode, omit. 12427 - If OpenCL, omit. 12428 - Could be split into 12429 separate s_waitcnt 12430 vmcnt(0) and s_waitcnt 12431 vscnt(0) to allow 12432 them to be 12433 independently moved 12434 according to the 12435 following rules. 12436 - s_waitcnt vmcnt(0) 12437 must happen after 12438 any preceding 12439 global/generic load/load 12440 atomic/ 12441 atomicrmw-with-return-value. 12442 - s_waitcnt vscnt(0) 12443 must happen after 12444 any preceding 12445 global/generic 12446 store/store atomic/ 12447 atomicrmw-no-return-value. 12448 - Must happen before 12449 the following 12450 store. 12451 - Ensures that all 12452 global memory 12453 operations have 12454 completed before 12455 performing the 12456 store that is being 12457 released. 12458 12459 2. ds_atomic 12460 3. s_waitcnt lgkmcnt(0) 12461 12462 - If OpenCL, omit. 12463 - Must happen before 12464 the following 12465 buffer_gl0_inv. 12466 - Ensures any 12467 following global 12468 data read is no 12469 older than the local load 12470 atomic value being 12471 acquired. 12472 12473 4. buffer_gl0_inv 12474 12475 - If CU wavefront execution 12476 mode, omit. 12477 - If OpenCL omit. 12478 - Ensures that 12479 following 12480 loads will not see 12481 stale data. 12482 12483 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 12484 vmcnt(0) & vscnt(0) 12485 12486 - If CU wavefront execution 12487 mode, omit vmcnt(0) and 12488 vscnt(0). 12489 - If OpenCL, omit lgkmcnt(0). 12490 - Could be split into 12491 separate s_waitcnt 12492 vmcnt(0), s_waitcnt 12493 vscnt(0) and s_waitcnt 12494 lgkmcnt(0) to allow 12495 them to be 12496 independently moved 12497 according to the 12498 following rules. 12499 - s_waitcnt vmcnt(0) 12500 must happen after 12501 any preceding 12502 global/generic load/load 12503 atomic/ 12504 atomicrmw-with-return-value. 12505 - s_waitcnt vscnt(0) 12506 must happen after 12507 any preceding 12508 global/generic 12509 store/store 12510 atomic/ 12511 atomicrmw-no-return-value. 12512 - s_waitcnt lgkmcnt(0) 12513 must happen after 12514 any preceding 12515 local/generic 12516 load/store/load 12517 atomic/store 12518 atomic/atomicrmw. 12519 - Must happen before 12520 the following 12521 atomicrmw. 12522 - Ensures that all 12523 memory operations 12524 have 12525 completed before 12526 performing the 12527 atomicrmw that is 12528 being released. 12529 12530 2. flat_atomic 12531 3. s_waitcnt lgkmcnt(0) & 12532 vmcnt(0) & vscnt(0) 12533 12534 - If CU wavefront execution 12535 mode, omit vmcnt(0) and 12536 vscnt(0). 12537 - If OpenCL, omit lgkmcnt(0). 12538 - Must happen before 12539 the following 12540 buffer_gl0_inv. 12541 - Ensures any 12542 following global 12543 data read is no 12544 older than the load 12545 atomic value being 12546 acquired. 12547 12548 3. buffer_gl0_inv 12549 12550 - If CU wavefront execution 12551 mode, omit. 12552 - Ensures that 12553 following 12554 loads will not see 12555 stale data. 12556 12557 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 12558 - system vmcnt(0) & vscnt(0) 12559 12560 - If OpenCL, omit 12561 lgkmcnt(0). 12562 - Could be split into 12563 separate s_waitcnt 12564 vmcnt(0), s_waitcnt 12565 vscnt(0) and s_waitcnt 12566 lgkmcnt(0) to allow 12567 them to be 12568 independently moved 12569 according to the 12570 following rules. 12571 - s_waitcnt vmcnt(0) 12572 must happen after 12573 any preceding 12574 global/generic 12575 load/load atomic/ 12576 atomicrmw-with-return-value. 12577 - s_waitcnt vscnt(0) 12578 must happen after 12579 any preceding 12580 global/generic 12581 store/store atomic/ 12582 atomicrmw-no-return-value. 12583 - s_waitcnt lgkmcnt(0) 12584 must happen after 12585 any preceding 12586 local/generic 12587 load/store/load 12588 atomic/store 12589 atomic/atomicrmw. 12590 - Must happen before 12591 the following 12592 atomicrmw. 12593 - Ensures that all 12594 memory operations 12595 to global have 12596 completed before 12597 performing the 12598 atomicrmw that is 12599 being released. 12600 12601 2. buffer/global_atomic 12602 3. s_waitcnt vm/vscnt(0) 12603 12604 - Use vmcnt(0) if atomic with 12605 return and vscnt(0) if 12606 atomic with no-return. 12607 - Must happen before 12608 following 12609 buffer_gl*_inv. 12610 - Ensures the 12611 atomicrmw has 12612 completed before 12613 invalidating the 12614 caches. 12615 12616 4. buffer_gl0_inv; 12617 buffer_gl1_inv 12618 12619 - Must happen before 12620 any following 12621 global/generic 12622 load/load 12623 atomic/atomicrmw. 12624 - Ensures that 12625 following loads 12626 will not see stale 12627 global data. 12628 12629 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 12630 - system vmcnt(0) & vscnt(0) 12631 12632 - If OpenCL, omit 12633 lgkmcnt(0). 12634 - Could be split into 12635 separate s_waitcnt 12636 vmcnt(0), s_waitcnt 12637 vscnt(0), and s_waitcnt 12638 lgkmcnt(0) to allow 12639 them to be 12640 independently moved 12641 according to the 12642 following rules. 12643 - s_waitcnt vmcnt(0) 12644 must happen after 12645 any preceding 12646 global/generic 12647 load/load atomic 12648 atomicrmw-with-return-value. 12649 - s_waitcnt vscnt(0) 12650 must happen after 12651 any preceding 12652 global/generic 12653 store/store atomic/ 12654 atomicrmw-no-return-value. 12655 - s_waitcnt lgkmcnt(0) 12656 must happen after 12657 any preceding 12658 local/generic 12659 load/store/load 12660 atomic/store 12661 atomic/atomicrmw. 12662 - Must happen before 12663 the following 12664 atomicrmw. 12665 - Ensures that all 12666 memory operations 12667 have 12668 completed before 12669 performing the 12670 atomicrmw that is 12671 being released. 12672 12673 2. flat_atomic 12674 3. s_waitcnt vm/vscnt(0) & 12675 lgkmcnt(0) 12676 12677 - If OpenCL, omit 12678 lgkmcnt(0). 12679 - Use vmcnt(0) if atomic with 12680 return and vscnt(0) if 12681 atomic with no-return. 12682 - Must happen before 12683 following 12684 buffer_gl*_inv. 12685 - Ensures the 12686 atomicrmw has 12687 completed before 12688 invalidating the 12689 caches. 12690 12691 4. buffer_gl0_inv; 12692 buffer_gl1_inv 12693 12694 - Must happen before 12695 any following 12696 global/generic 12697 load/load 12698 atomic/atomicrmw. 12699 - Ensures that 12700 following loads 12701 will not see stale 12702 global data. 12703 12704 fence acq_rel - singlethread *none* *none* 12705 - wavefront 12706 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12707 vmcnt(0) & vscnt(0) 12708 12709 - If CU wavefront execution 12710 mode, omit vmcnt(0) and 12711 vscnt(0). 12712 - If OpenCL and 12713 address space is 12714 not generic, omit 12715 lgkmcnt(0). 12716 - If OpenCL and 12717 address space is 12718 local, omit 12719 vmcnt(0) and vscnt(0). 12720 - However, 12721 since LLVM 12722 currently has no 12723 address space on 12724 the fence need to 12725 conservatively 12726 always generate 12727 (see comment for 12728 previous fence). 12729 - Could be split into 12730 separate s_waitcnt 12731 vmcnt(0), s_waitcnt 12732 vscnt(0) and s_waitcnt 12733 lgkmcnt(0) to allow 12734 them to be 12735 independently moved 12736 according to the 12737 following rules. 12738 - s_waitcnt vmcnt(0) 12739 must happen after 12740 any preceding 12741 global/generic 12742 load/load 12743 atomic/ 12744 atomicrmw-with-return-value. 12745 - s_waitcnt vscnt(0) 12746 must happen after 12747 any preceding 12748 global/generic 12749 store/store atomic/ 12750 atomicrmw-no-return-value. 12751 - s_waitcnt lgkmcnt(0) 12752 must happen after 12753 any preceding 12754 local/generic 12755 load/store/load 12756 atomic/store atomic/ 12757 atomicrmw. 12758 - Must happen before 12759 any following 12760 global/generic 12761 load/load 12762 atomic/store/store 12763 atomic/atomicrmw. 12764 - Ensures that all 12765 memory operations 12766 have 12767 completed before 12768 performing any 12769 following global 12770 memory operations. 12771 - Ensures that the 12772 preceding 12773 local/generic load 12774 atomic/atomicrmw 12775 with an equal or 12776 wider sync scope 12777 and memory ordering 12778 stronger than 12779 unordered (this is 12780 termed the 12781 acquire-fence-paired-atomic) 12782 has completed 12783 before following 12784 global memory 12785 operations. This 12786 satisfies the 12787 requirements of 12788 acquire. 12789 - Ensures that all 12790 previous memory 12791 operations have 12792 completed before a 12793 following 12794 local/generic store 12795 atomic/atomicrmw 12796 with an equal or 12797 wider sync scope 12798 and memory ordering 12799 stronger than 12800 unordered (this is 12801 termed the 12802 release-fence-paired-atomic). 12803 This satisfies the 12804 requirements of 12805 release. 12806 - Must happen before 12807 the following 12808 buffer_gl0_inv. 12809 - Ensures that the 12810 acquire-fence-paired 12811 atomic has completed 12812 before invalidating 12813 the 12814 cache. Therefore 12815 any following 12816 locations read must 12817 be no older than 12818 the value read by 12819 the 12820 acquire-fence-paired-atomic. 12821 12822 3. buffer_gl0_inv 12823 12824 - If CU wavefront execution 12825 mode, omit. 12826 - Ensures that 12827 following 12828 loads will not see 12829 stale data. 12830 12831 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 12832 - system vmcnt(0) & vscnt(0) 12833 12834 - If OpenCL and 12835 address space is 12836 not generic, omit 12837 lgkmcnt(0). 12838 - If OpenCL and 12839 address space is 12840 local, omit 12841 vmcnt(0) and vscnt(0). 12842 - However, since LLVM 12843 currently has no 12844 address space on 12845 the fence need to 12846 conservatively 12847 always generate 12848 (see comment for 12849 previous fence). 12850 - Could be split into 12851 separate s_waitcnt 12852 vmcnt(0), s_waitcnt 12853 vscnt(0) and s_waitcnt 12854 lgkmcnt(0) to allow 12855 them to be 12856 independently moved 12857 according to the 12858 following rules. 12859 - s_waitcnt vmcnt(0) 12860 must happen after 12861 any preceding 12862 global/generic 12863 load/load 12864 atomic/ 12865 atomicrmw-with-return-value. 12866 - s_waitcnt vscnt(0) 12867 must happen after 12868 any preceding 12869 global/generic 12870 store/store atomic/ 12871 atomicrmw-no-return-value. 12872 - s_waitcnt lgkmcnt(0) 12873 must happen after 12874 any preceding 12875 local/generic 12876 load/store/load 12877 atomic/store 12878 atomic/atomicrmw. 12879 - Must happen before 12880 the following 12881 buffer_gl*_inv. 12882 - Ensures that the 12883 preceding 12884 global/local/generic 12885 load 12886 atomic/atomicrmw 12887 with an equal or 12888 wider sync scope 12889 and memory ordering 12890 stronger than 12891 unordered (this is 12892 termed the 12893 acquire-fence-paired-atomic) 12894 has completed 12895 before invalidating 12896 the caches. This 12897 satisfies the 12898 requirements of 12899 acquire. 12900 - Ensures that all 12901 previous memory 12902 operations have 12903 completed before a 12904 following 12905 global/local/generic 12906 store 12907 atomic/atomicrmw 12908 with an equal or 12909 wider sync scope 12910 and memory ordering 12911 stronger than 12912 unordered (this is 12913 termed the 12914 release-fence-paired-atomic). 12915 This satisfies the 12916 requirements of 12917 release. 12918 12919 2. buffer_gl0_inv; 12920 buffer_gl1_inv 12921 12922 - Must happen before 12923 any following 12924 global/generic 12925 load/load 12926 atomic/store/store 12927 atomic/atomicrmw. 12928 - Ensures that 12929 following loads 12930 will not see stale 12931 global data. This 12932 satisfies the 12933 requirements of 12934 acquire. 12935 12936 **Sequential Consistent Atomic** 12937 ------------------------------------------------------------------------------------ 12938 load atomic seq_cst - singlethread - global *Same as corresponding 12939 - wavefront - local load atomic acquire, 12940 - generic except must generate 12941 all instructions even 12942 for OpenCL.* 12943 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12944 - generic vmcnt(0) & vscnt(0) 12945 12946 - If CU wavefront execution 12947 mode, omit vmcnt(0) and 12948 vscnt(0). 12949 - Could be split into 12950 separate s_waitcnt 12951 vmcnt(0), s_waitcnt 12952 vscnt(0), and s_waitcnt 12953 lgkmcnt(0) to allow 12954 them to be 12955 independently moved 12956 according to the 12957 following rules. 12958 - s_waitcnt lgkmcnt(0) must 12959 happen after 12960 preceding 12961 local/generic load 12962 atomic/store 12963 atomic/atomicrmw 12964 with memory 12965 ordering of seq_cst 12966 and with equal or 12967 wider sync scope. 12968 (Note that seq_cst 12969 fences have their 12970 own s_waitcnt 12971 lgkmcnt(0) and so do 12972 not need to be 12973 considered.) 12974 - s_waitcnt vmcnt(0) 12975 must happen after 12976 preceding 12977 global/generic load 12978 atomic/ 12979 atomicrmw-with-return-value 12980 with memory 12981 ordering of seq_cst 12982 and with equal or 12983 wider sync scope. 12984 (Note that seq_cst 12985 fences have their 12986 own s_waitcnt 12987 vmcnt(0) and so do 12988 not need to be 12989 considered.) 12990 - s_waitcnt vscnt(0) 12991 Must happen after 12992 preceding 12993 global/generic store 12994 atomic/ 12995 atomicrmw-no-return-value 12996 with memory 12997 ordering of seq_cst 12998 and with equal or 12999 wider sync scope. 13000 (Note that seq_cst 13001 fences have their 13002 own s_waitcnt 13003 vscnt(0) and so do 13004 not need to be 13005 considered.) 13006 - Ensures any 13007 preceding 13008 sequential 13009 consistent global/local 13010 memory instructions 13011 have completed 13012 before executing 13013 this sequentially 13014 consistent 13015 instruction. This 13016 prevents reordering 13017 a seq_cst store 13018 followed by a 13019 seq_cst load. (Note 13020 that seq_cst is 13021 stronger than 13022 acquire/release as 13023 the reordering of 13024 load acquire 13025 followed by a store 13026 release is 13027 prevented by the 13028 s_waitcnt of 13029 the release, but 13030 there is nothing 13031 preventing a store 13032 release followed by 13033 load acquire from 13034 completing out of 13035 order. The s_waitcnt 13036 could be placed after 13037 seq_store or before 13038 the seq_load. We 13039 choose the load to 13040 make the s_waitcnt be 13041 as late as possible 13042 so that the store 13043 may have already 13044 completed.) 13045 13046 2. *Following 13047 instructions same as 13048 corresponding load 13049 atomic acquire, 13050 except must generate 13051 all instructions even 13052 for OpenCL.* 13053 load atomic seq_cst - workgroup - local 13054 13055 1. s_waitcnt vmcnt(0) & vscnt(0) 13056 13057 - If CU wavefront execution 13058 mode, omit. 13059 - Could be split into 13060 separate s_waitcnt 13061 vmcnt(0) and s_waitcnt 13062 vscnt(0) to allow 13063 them to be 13064 independently moved 13065 according to the 13066 following rules. 13067 - s_waitcnt vmcnt(0) 13068 Must happen after 13069 preceding 13070 global/generic load 13071 atomic/ 13072 atomicrmw-with-return-value 13073 with memory 13074 ordering of seq_cst 13075 and with equal or 13076 wider sync scope. 13077 (Note that seq_cst 13078 fences have their 13079 own s_waitcnt 13080 vmcnt(0) and so do 13081 not need to be 13082 considered.) 13083 - s_waitcnt vscnt(0) 13084 Must happen after 13085 preceding 13086 global/generic store 13087 atomic/ 13088 atomicrmw-no-return-value 13089 with memory 13090 ordering of seq_cst 13091 and with equal or 13092 wider sync scope. 13093 (Note that seq_cst 13094 fences have their 13095 own s_waitcnt 13096 vscnt(0) and so do 13097 not need to be 13098 considered.) 13099 - Ensures any 13100 preceding 13101 sequential 13102 consistent global 13103 memory instructions 13104 have completed 13105 before executing 13106 this sequentially 13107 consistent 13108 instruction. This 13109 prevents reordering 13110 a seq_cst store 13111 followed by a 13112 seq_cst load. (Note 13113 that seq_cst is 13114 stronger than 13115 acquire/release as 13116 the reordering of 13117 load acquire 13118 followed by a store 13119 release is 13120 prevented by the 13121 s_waitcnt of 13122 the release, but 13123 there is nothing 13124 preventing a store 13125 release followed by 13126 load acquire from 13127 completing out of 13128 order. The s_waitcnt 13129 could be placed after 13130 seq_store or before 13131 the seq_load. We 13132 choose the load to 13133 make the s_waitcnt be 13134 as late as possible 13135 so that the store 13136 may have already 13137 completed.) 13138 13139 2. *Following 13140 instructions same as 13141 corresponding load 13142 atomic acquire, 13143 except must generate 13144 all instructions even 13145 for OpenCL.* 13146 13147 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 13148 - system - generic vmcnt(0) & vscnt(0) 13149 13150 - Could be split into 13151 separate s_waitcnt 13152 vmcnt(0), s_waitcnt 13153 vscnt(0) and s_waitcnt 13154 lgkmcnt(0) to allow 13155 them to be 13156 independently moved 13157 according to the 13158 following rules. 13159 - s_waitcnt lgkmcnt(0) 13160 must happen after 13161 preceding 13162 local load 13163 atomic/store 13164 atomic/atomicrmw 13165 with memory 13166 ordering of seq_cst 13167 and with equal or 13168 wider sync scope. 13169 (Note that seq_cst 13170 fences have their 13171 own s_waitcnt 13172 lgkmcnt(0) and so do 13173 not need to be 13174 considered.) 13175 - s_waitcnt vmcnt(0) 13176 must happen after 13177 preceding 13178 global/generic load 13179 atomic/ 13180 atomicrmw-with-return-value 13181 with memory 13182 ordering of seq_cst 13183 and with equal or 13184 wider sync scope. 13185 (Note that seq_cst 13186 fences have their 13187 own s_waitcnt 13188 vmcnt(0) and so do 13189 not need to be 13190 considered.) 13191 - s_waitcnt vscnt(0) 13192 Must happen after 13193 preceding 13194 global/generic store 13195 atomic/ 13196 atomicrmw-no-return-value 13197 with memory 13198 ordering of seq_cst 13199 and with equal or 13200 wider sync scope. 13201 (Note that seq_cst 13202 fences have their 13203 own s_waitcnt 13204 vscnt(0) and so do 13205 not need to be 13206 considered.) 13207 - Ensures any 13208 preceding 13209 sequential 13210 consistent global 13211 memory instructions 13212 have completed 13213 before executing 13214 this sequentially 13215 consistent 13216 instruction. This 13217 prevents reordering 13218 a seq_cst store 13219 followed by a 13220 seq_cst load. (Note 13221 that seq_cst is 13222 stronger than 13223 acquire/release as 13224 the reordering of 13225 load acquire 13226 followed by a store 13227 release is 13228 prevented by the 13229 s_waitcnt of 13230 the release, but 13231 there is nothing 13232 preventing a store 13233 release followed by 13234 load acquire from 13235 completing out of 13236 order. The s_waitcnt 13237 could be placed after 13238 seq_store or before 13239 the seq_load. We 13240 choose the load to 13241 make the s_waitcnt be 13242 as late as possible 13243 so that the store 13244 may have already 13245 completed.) 13246 13247 2. *Following 13248 instructions same as 13249 corresponding load 13250 atomic acquire, 13251 except must generate 13252 all instructions even 13253 for OpenCL.* 13254 store atomic seq_cst - singlethread - global *Same as corresponding 13255 - wavefront - local store atomic release, 13256 - workgroup - generic except must generate 13257 - agent all instructions even 13258 - system for OpenCL.* 13259 atomicrmw seq_cst - singlethread - global *Same as corresponding 13260 - wavefront - local atomicrmw acq_rel, 13261 - workgroup - generic except must generate 13262 - agent all instructions even 13263 - system for OpenCL.* 13264 fence seq_cst - singlethread *none* *Same as corresponding 13265 - wavefront fence acq_rel, 13266 - workgroup except must generate 13267 - agent all instructions even 13268 - system for OpenCL.* 13269 ============ ============ ============== ========== ================================ 13270 13271.. _amdgpu-amdhsa-trap-handler-abi: 13272 13273Trap Handler ABI 13274~~~~~~~~~~~~~~~~ 13275 13276For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 13277runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 13278supports the ``s_trap`` instruction. For usage see: 13279 13280- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 13281- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 13282- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table` 13283 13284 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 13285 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 13286 13287 =================== =============== =============== ======================================= 13288 Usage Code Sequence Trap Handler Description 13289 Inputs 13290 =================== =============== =============== ======================================= 13291 reserved ``s_trap 0x00`` Reserved by hardware. 13292 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 13293 ``queue_ptr`` intrinsic (not implemented). 13294 ``VGPR0``: 13295 ``arg`` 13296 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13297 ``queue_ptr`` the trap instruction. The associated 13298 queue is signalled to put it into the 13299 error state. When the queue is put in 13300 the error state, the waves executing 13301 dispatches on the queue will be 13302 terminated. 13303 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13304 as a no-operation. The trap handler 13305 is entered and immediately returns to 13306 continue execution of the wavefront. 13307 - If the debugger is enabled, causes 13308 the debug trap to be reported by the 13309 debugger and the wavefront is put in 13310 the halt state with the PC at the 13311 instruction. The debugger must 13312 increment the PC and resume the wave. 13313 reserved ``s_trap 0x04`` Reserved. 13314 reserved ``s_trap 0x05`` Reserved. 13315 reserved ``s_trap 0x06`` Reserved. 13316 reserved ``s_trap 0x07`` Reserved. 13317 reserved ``s_trap 0x08`` Reserved. 13318 reserved ``s_trap 0xfe`` Reserved. 13319 reserved ``s_trap 0xff`` Reserved. 13320 =================== =============== =============== ======================================= 13321 13322.. 13323 13324 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 13325 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 13326 13327 =================== =============== =============== ======================================= 13328 Usage Code Sequence Trap Handler Description 13329 Inputs 13330 =================== =============== =============== ======================================= 13331 reserved ``s_trap 0x00`` Reserved by hardware. 13332 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 13333 breakpoints. Causes wave to be halted 13334 with the PC at the trap instruction. 13335 The debugger is responsible to resume 13336 the wave, including the instruction 13337 that the breakpoint overwrote. 13338 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13339 ``queue_ptr`` the trap instruction. The associated 13340 queue is signalled to put it into the 13341 error state. When the queue is put in 13342 the error state, the waves executing 13343 dispatches on the queue will be 13344 terminated. 13345 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13346 as a no-operation. The trap handler 13347 is entered and immediately returns to 13348 continue execution of the wavefront. 13349 - If the debugger is enabled, causes 13350 the debug trap to be reported by the 13351 debugger and the wavefront is put in 13352 the halt state with the PC at the 13353 instruction. The debugger must 13354 increment the PC and resume the wave. 13355 reserved ``s_trap 0x04`` Reserved. 13356 reserved ``s_trap 0x05`` Reserved. 13357 reserved ``s_trap 0x06`` Reserved. 13358 reserved ``s_trap 0x07`` Reserved. 13359 reserved ``s_trap 0x08`` Reserved. 13360 reserved ``s_trap 0xfe`` Reserved. 13361 reserved ``s_trap 0xff`` Reserved. 13362 =================== =============== =============== ======================================= 13363 13364.. 13365 13366 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above 13367 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table 13368 13369 =================== =============== ================ ================= ======================================= 13370 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description 13371 =================== =============== ================ ================= ======================================= 13372 reserved ``s_trap 0x00`` Reserved by hardware. 13373 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 13374 breakpoints. Causes wave to be halted 13375 with the PC at the trap instruction. 13376 The debugger is responsible to resume 13377 the wave, including the instruction 13378 that the breakpoint overwrote. 13379 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 13380 ``queue_ptr`` the trap instruction. The associated 13381 queue is signalled to put it into the 13382 error state. When the queue is put in 13383 the error state, the waves executing 13384 dispatches on the queue will be 13385 terminated. 13386 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 13387 as a no-operation. The trap handler 13388 is entered and immediately returns to 13389 continue execution of the wavefront. 13390 - If the debugger is enabled, causes 13391 the debug trap to be reported by the 13392 debugger and the wavefront is put in 13393 the halt state with the PC at the 13394 instruction. The debugger must 13395 increment the PC and resume the wave. 13396 reserved ``s_trap 0x04`` Reserved. 13397 reserved ``s_trap 0x05`` Reserved. 13398 reserved ``s_trap 0x06`` Reserved. 13399 reserved ``s_trap 0x07`` Reserved. 13400 reserved ``s_trap 0x08`` Reserved. 13401 reserved ``s_trap 0xfe`` Reserved. 13402 reserved ``s_trap 0xff`` Reserved. 13403 =================== =============== ================ ================= ======================================= 13404 13405.. _amdgpu-amdhsa-function-call-convention: 13406 13407Call Convention 13408~~~~~~~~~~~~~~~ 13409 13410.. note:: 13411 13412 This section is currently incomplete and has inaccuracies. It is WIP that will 13413 be updated as information is determined. 13414 13415See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 13416addresses. Unswizzled addresses are normal linear addresses. 13417 13418.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 13419 13420Kernel Functions 13421++++++++++++++++ 13422 13423This section describes the call convention ABI for the outer kernel function. 13424 13425See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 13426convention. 13427 13428The following is not part of the AMDGPU kernel calling convention but describes 13429how the AMDGPU implements function calls: 13430 134311. Clang decides the kernarg layout to match the *HSA Programmer's Language 13432 Reference* [HSA]_. 13433 13434 - All structs are passed directly. 13435 - Lambda values are passed *TBA*. 13436 13437 .. TODO:: 13438 13439 - Does this really follow HSA rules? Or are structs >16 bytes passed 13440 by-value struct? 13441 - What is ABI for lambda values? 13442 134434. The kernel performs certain setup in its prolog, as described in 13444 :ref:`amdgpu-amdhsa-kernel-prolog`. 13445 13446.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 13447 13448Non-Kernel Functions 13449++++++++++++++++++++ 13450 13451This section describes the call convention ABI for functions other than the 13452outer kernel function. 13453 13454If a kernel has function calls then scratch is always allocated and used for 13455the call stack which grows from low address to high address using the swizzled 13456scratch address space. 13457 13458On entry to a function: 13459 134601. SGPR0-3 contain a V# with the following properties (see 13461 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 13462 13463 * Base address pointing to the beginning of the wavefront scratch backing 13464 memory. 13465 * Swizzled with dword element size and stride of wavefront size elements. 13466 134672. The FLAT_SCRATCH register pair is setup. See 13468 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 134693. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 13470 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 134714. The EXEC register is set to the lanes active on entry to the function. 134725. MODE register: *TBD* 134736. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 13474 below. 134757. SGPR30-31 return address (RA). The code address that the function must 13476 return to when it completes. The value is undefined if the function is *no 13477 return*. 134788. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 13479 offset relative to the beginning of the wavefront scratch backing memory. 13480 13481 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 13482 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 13483 manner. 13484 13485 The unswizzled SP value can be converted into the swizzled SP value by: 13486 13487 | swizzled SP = unswizzled SP / wavefront size 13488 13489 This may be used to obtain the private address space address of stack 13490 objects and to convert this address to a flat address by adding the flat 13491 scratch aperture base address. 13492 13493 The swizzled SP value is always 4 bytes aligned for the ``r600`` 13494 architecture and 16 byte aligned for the ``amdgcn`` architecture. 13495 13496 .. note:: 13497 13498 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 13499 OpenCL language which has the largest base type defined as 16 bytes. 13500 13501 On entry, the swizzled SP value is the address of the first function 13502 argument passed on the stack. Other stack passed arguments are positive 13503 offsets from the entry swizzled SP value. 13504 13505 The function may use positive offsets beyond the last stack passed argument 13506 for stack allocated local variables and register spill slots. If necessary, 13507 the function may align these to greater alignment than 16 bytes. After these 13508 the function may dynamically allocate space for such things as runtime sized 13509 ``alloca`` local allocations. 13510 13511 If the function calls another function, it will place any stack allocated 13512 arguments after the last local allocation and adjust SGPR32 to the address 13513 after the last local allocation. 13514 135159. All other registers are unspecified. 1351610. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 13517 to the function. 13518 13519On exit from a function: 13520 135211. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 13522 described below. Any registers used are considered clobbered registers. 135232. The following registers are preserved and have the same value as on entry: 13524 13525 * FLAT_SCRATCH 13526 * EXEC 13527 * GFX6-GFX8: M0 13528 * All SGPR registers except the clobbered registers of SGPR4-31. 13529 * VGPR40-47 13530 * VGPR56-63 13531 * VGPR72-79 13532 * VGPR88-95 13533 * VGPR104-111 13534 * VGPR120-127 13535 * VGPR136-143 13536 * VGPR152-159 13537 * VGPR168-175 13538 * VGPR184-191 13539 * VGPR200-207 13540 * VGPR216-223 13541 * VGPR232-239 13542 * VGPR248-255 13543 13544 .. note:: 13545 13546 Except the argument registers, the VGPRs clobbered and the preserved 13547 registers are intermixed at regular intervals in order to keep a 13548 similar ratio independent of the number of allocated VGPRs. 13549 13550 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31. 13551 * Lanes of all VGPRs that are inactive at the call site. 13552 13553 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 13554 optimization may mark some of clobbered SGPR and VGPR registers as 13555 preserved if it can be determined that the called function does not change 13556 their value. 13557 135582. The PC is set to the RA provided on entry. 135593. MODE register: *TBD*. 135604. All other registers are clobbered. 135615. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 13562 function is available to the caller. 13563 13564.. TODO:: 13565 13566 - How are function results returned? The address of structured types is passed 13567 by reference, but what about other types? 13568 13569The function input arguments are made up of the formal arguments explicitly 13570declared by the source language function plus the implicit input arguments used 13571by the implementation. 13572 13573The source language input arguments are: 13574 135751. Any source language implicit ``this`` or ``self`` argument comes first as a 13576 pointer type. 135772. Followed by the function formal arguments in left to right source order. 13578 13579The source language result arguments are: 13580 135811. The function result argument. 13582 13583The source language input or result struct type arguments that are less than or 13584equal to 16 bytes, are decomposed recursively into their base type fields, and 13585each field is passed as if a separate argument. For input arguments, if the 13586called function requires the struct to be in memory, for example because its 13587address is taken, then the function body is responsible for allocating a stack 13588location and copying the field arguments into it. Clang terms this *direct 13589struct*. 13590 13591The source language input struct type arguments that are greater than 16 bytes, 13592are passed by reference. The caller is responsible for allocating a stack 13593location to make a copy of the struct value and pass the address as the input 13594argument. The called function is responsible to perform the dereference when 13595accessing the input argument. Clang terms this *by-value struct*. 13596 13597A source language result struct type argument that is greater than 16 bytes, is 13598returned by reference. The caller is responsible for allocating a stack location 13599to hold the result value and passes the address as the last input argument 13600(before the implicit input arguments). In this case there are no result 13601arguments. The called function is responsible to perform the dereference when 13602storing the result value. Clang terms this *structured return (sret)*. 13603 13604*TODO: correct the ``sret`` definition.* 13605 13606.. TODO:: 13607 13608 Is this definition correct? Or is ``sret`` only used if passing in registers, and 13609 pass as non-decomposed struct as stack argument? Or something else? Is the 13610 memory location in the caller stack frame, or a stack memory argument and so 13611 no address is passed as the caller can directly write to the argument stack 13612 location? But then the stack location is still live after return. If an 13613 argument stack location is it the first stack argument or the last one? 13614 13615Lambda argument types are treated as struct types with an implementation defined 13616set of fields. 13617 13618.. TODO:: 13619 13620 Need to specify the ABI for lambda types for AMDGPU. 13621 13622For AMDGPU backend all source language arguments (including the decomposed 13623struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 13624they are passed in SGPRs. 13625 13626The AMDGPU backend walks the function call graph from the leaves to determine 13627which implicit input arguments are used, propagating to each caller of the 13628function. The used implicit arguments are appended to the function arguments 13629after the source language arguments in the following order: 13630 13631.. TODO:: 13632 13633 Is recursion or external functions supported? 13634 136351. Work-Item ID (1 VGPR) 13636 13637 The X, Y and Z work-item ID are packed into a single VGRP with the following 13638 layout. Only fields actually used by the function are set. The other bits 13639 are undefined. 13640 13641 The values come from the initial kernel execution state. See 13642 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 13643 13644 .. table:: Work-item implicit argument layout 13645 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 13646 13647 ======= ======= ============== 13648 Bits Size Field Name 13649 ======= ======= ============== 13650 9:0 10 bits X Work-Item ID 13651 19:10 10 bits Y Work-Item ID 13652 29:20 10 bits Z Work-Item ID 13653 31:30 2 bits Unused 13654 ======= ======= ============== 13655 136562. Dispatch Ptr (2 SGPRs) 13657 13658 The value comes from the initial kernel execution state. See 13659 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13660 136613. Queue Ptr (2 SGPRs) 13662 13663 The value comes from the initial kernel execution state. See 13664 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13665 136664. Kernarg Segment Ptr (2 SGPRs) 13667 13668 The value comes from the initial kernel execution state. See 13669 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13670 136715. Dispatch id (2 SGPRs) 13672 13673 The value comes from the initial kernel execution state. See 13674 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13675 136766. Work-Group ID X (1 SGPR) 13677 13678 The value comes from the initial kernel execution state. See 13679 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13680 136817. Work-Group ID Y (1 SGPR) 13682 13683 The value comes from the initial kernel execution state. See 13684 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13685 136868. Work-Group ID Z (1 SGPR) 13687 13688 The value comes from the initial kernel execution state. See 13689 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13690 136919. Implicit Argument Ptr (2 SGPRs) 13692 13693 The value is computed by adding an offset to Kernarg Segment Ptr to get the 13694 global address space pointer to the first kernarg implicit argument. 13695 13696The input and result arguments are assigned in order in the following manner: 13697 13698.. note:: 13699 13700 There are likely some errors and omissions in the following description that 13701 need correction. 13702 13703 .. TODO:: 13704 13705 Check the Clang source code to decipher how function arguments and return 13706 results are handled. Also see the AMDGPU specific values used. 13707 13708* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 13709 VGPR31. 13710 13711 If there are more arguments than will fit in these registers, the remaining 13712 arguments are allocated on the stack in order on naturally aligned 13713 addresses. 13714 13715 .. TODO:: 13716 13717 How are overly aligned structures allocated on the stack? 13718 13719* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 13720 SGPR29. 13721 13722 If there are more arguments than will fit in these registers, the remaining 13723 arguments are allocated on the stack in order on naturally aligned 13724 addresses. 13725 13726Note that decomposed struct type arguments may have some fields passed in 13727registers and some in memory. 13728 13729.. TODO:: 13730 13731 So, a struct which can pass some fields as decomposed register arguments, will 13732 pass the rest as decomposed stack elements? But an argument that will not start 13733 in registers will not be decomposed and will be passed as a non-decomposed 13734 stack value? 13735 13736The following is not part of the AMDGPU function calling convention but 13737describes how the AMDGPU implements function calls: 13738 137391. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 13740 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 13741 are used, or for the reasons defined in ``SIFrameLowering``. 137422. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 13743 to access the incoming stack arguments in the function. The BP is needed 13744 only when the function requires the runtime stack alignment. 13745 137463. Allocating SGPR arguments on the stack are not supported. 13747 137484. No CFI is currently generated. See 13749 :ref:`amdgpu-dwarf-call-frame-information`. 13750 13751 .. note:: 13752 13753 CFI will be generated that defines the CFA as the unswizzled address 13754 relative to the wave scratch base in the unswizzled private address space 13755 of the lowest address stack allocated local variable. 13756 13757 ``DW_AT_frame_base`` will be defined as the swizzled address in the 13758 swizzled private address space by dividing the CFA by the wavefront size 13759 (since CFA is always at least dword aligned which matches the scratch 13760 swizzle element size). 13761 13762 If no dynamic stack alignment was performed, the stack allocated arguments 13763 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 13764 local variables and register spill slots are accessed as positive offsets 13765 relative to ``DW_AT_frame_base``. 13766 137675. Function argument passing is implemented by copying the input physical 13768 registers to virtual registers on entry. The register allocator can spill if 13769 necessary. These are copied back to physical registers at call sites. The 13770 net effect is that each function call can have these values in entirely 13771 distinct locations. The IPRA can help avoid shuffling argument registers. 137726. Call sites are implemented by setting up the arguments at positive offsets 13773 from SP. Then SP is incremented to account for the known frame size before 13774 the call and decremented after the call. 13775 13776 .. note:: 13777 13778 The CFI will reflect the changed calculation needed to compute the CFA 13779 from SP. 13780 137817. 4 byte spill slots are used in the stack frame. One slot is allocated for an 13782 emergency spill slot. Buffer instructions are used for stack accesses and 13783 not the ``flat_scratch`` instruction. 13784 13785 .. TODO:: 13786 13787 Explain when the emergency spill slot is used. 13788 13789.. TODO:: 13790 13791 Possible broken issues: 13792 13793 - Stack arguments must be aligned to required alignment. 13794 - Stack is aligned to max(16, max formal argument alignment) 13795 - Direct argument < 64 bits should check register budget. 13796 - Register budget calculation should respect ``inreg`` for SGPR. 13797 - SGPR overflow is not handled. 13798 - struct with 1 member unpeeling is not checking size of member. 13799 - ``sret`` is after ``this`` pointer. 13800 - Caller is not implementing stack realignment: need an extra pointer. 13801 - Should say AMDGPU passes FP rather than SP. 13802 - Should CFI define CFA as address of locals or arguments. Difference is 13803 apparent when have implemented dynamic alignment. 13804 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 13805 highest address of stack frame and use negative offset for locals. Would 13806 allow SP to be the same as FP and could support signal-handler-like as now 13807 have a real SP for the top of the stack. 13808 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 13809 arguments? 13810 13811AMDPAL 13812------ 13813 13814This section provides code conventions used when the target triple OS is 13815``amdpal`` (see :ref:`amdgpu-target-triples`). 13816 13817.. _amdgpu-amdpal-code-object-metadata-section: 13818 13819Code Object Metadata 13820~~~~~~~~~~~~~~~~~~~~ 13821 13822.. note:: 13823 13824 The metadata is currently in development and is subject to major 13825 changes. Only the current version is supported. *When this document 13826 was generated the version was 2.6.* 13827 13828Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 13829record (see :ref:`amdgpu-note-records-v3-onwards`). 13830 13831The metadata is represented as Message Pack formatted binary data (see 13832[MsgPack]_). The top level is a Message Pack map that includes the keys 13833defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 13834and referenced tables. 13835 13836Additional information can be added to the maps. To avoid conflicts, any 13837key names should be prefixed by "*vendor-name*." where ``vendor-name`` 13838can be the name of the vendor and specific vendor tool that generates the 13839information. The prefix is abbreviated to simply "." when it appears 13840within a map that has been added by the same *vendor-name*. 13841 13842 .. table:: AMDPAL Code Object Metadata Map 13843 :name: amdgpu-amdpal-code-object-metadata-map-table 13844 13845 =================== ============== ========= ====================================================================== 13846 String Key Value Type Required? Description 13847 =================== ============== ========= ====================================================================== 13848 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 13849 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 13850 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 13851 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 13852 definition of the keys included in that map. 13853 =================== ============== ========= ====================================================================== 13854 13855.. 13856 13857 .. table:: AMDPAL Code Object Pipeline Metadata Map 13858 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 13859 13860 ====================================== ============== ========= =================================================== 13861 String Key Value Type Required? Description 13862 ====================================== ============== ========= =================================================== 13863 ".name" string Source name of the pipeline. 13864 ".type" string Pipeline type, e.g. VsPs. Values include: 13865 13866 - "VsPs" 13867 - "Gs" 13868 - "Cs" 13869 - "Ngg" 13870 - "Tess" 13871 - "GsTess" 13872 - "NggTess" 13873 13874 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 13875 2 integers 64 bits is the "stable" portion of the hash, used 13876 for e.g. shader replacement lookup. Upper 64 bits 13877 is the "unique" portion of the hash, used for 13878 e.g. pipeline cache lookup. The value is 13879 implementation defined, and can not be relied on 13880 between different builds of the compiler. 13881 ".shaders" map Per-API shader metadata. See 13882 :ref:`amdgpu-amdpal-code-object-shader-map-table` 13883 for the definition of the keys included in that 13884 map. 13885 ".hardware_stages" map Per-hardware stage metadata. See 13886 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 13887 for the definition of the keys included in that 13888 map. 13889 ".shader_functions" map Per-shader function metadata. See 13890 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 13891 for the definition of the keys included in that 13892 map. 13893 ".registers" map Required Hardware register configuration. See 13894 :ref:`amdgpu-amdpal-code-object-register-map-table` 13895 for the definition of the keys included in that 13896 map. 13897 ".user_data_limit" integer Number of user data entries accessed by this 13898 pipeline. 13899 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 13900 NoUserDataSpilling. 13901 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 13902 viewport array index feature. Pipelines which use 13903 this feature can render into all 16 viewports, 13904 whereas pipelines which do not use it are 13905 restricted to viewport #0. 13906 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 13907 handling data-passing between the ES and GS 13908 shader stages. This can be zero if the data is 13909 passed using off-chip buffers. This value should 13910 be used to program all user-SGPRs which have been 13911 marked with "UserDataMapping::EsGsLdsSize" 13912 (typically only the GS and VS HW stages will ever 13913 have a user-SGPR so marked). 13914 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 13915 (maximum number of threads in a subgroup). 13916 ".num_interpolants" integer Graphics only. Number of PS interpolants. 13917 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 13918 ".api" string Name of the client graphics API. 13919 ".api_create_info" binary Graphics API shader create info binary blob. Can 13920 be defined by the driver using the compiler if 13921 they want to be able to correlate API-specific 13922 information used during creation at a later time. 13923 ====================================== ============== ========= =================================================== 13924 13925.. 13926 13927 .. table:: AMDPAL Code Object Shader Map 13928 :name: amdgpu-amdpal-code-object-shader-map-table 13929 13930 13931 +-------------+--------------+-------------------------------------------------------------------+ 13932 |String Key |Value Type |Description | 13933 +=============+==============+===================================================================+ 13934 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 13935 |- ".vertex" | |for the definition of the keys included in that map. | 13936 |- ".hull" | | | 13937 |- ".domain" | | | 13938 |- ".geometry"| | | 13939 |- ".pixel" | | | 13940 +-------------+--------------+-------------------------------------------------------------------+ 13941 13942.. 13943 13944 .. table:: AMDPAL Code Object API Shader Metadata Map 13945 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 13946 13947 ==================== ============== ========= ===================================================================== 13948 String Key Value Type Required? Description 13949 ==================== ============== ========= ===================================================================== 13950 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 13951 2 integers is implementation defined, and can not be relied on between 13952 different builds of the compiler. 13953 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 13954 string include: 13955 13956 - ".ls" 13957 - ".hs" 13958 - ".es" 13959 - ".gs" 13960 - ".vs" 13961 - ".ps" 13962 - ".cs" 13963 13964 ==================== ============== ========= ===================================================================== 13965 13966.. 13967 13968 .. table:: AMDPAL Code Object Hardware Stage Map 13969 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 13970 13971 +-------------+--------------+-----------------------------------------------------------------------+ 13972 |String Key |Value Type |Description | 13973 +=============+==============+=======================================================================+ 13974 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 13975 |- ".hs" | |for the definition of the keys included in that map. | 13976 |- ".es" | | | 13977 |- ".gs" | | | 13978 |- ".vs" | | | 13979 |- ".ps" | | | 13980 |- ".cs" | | | 13981 +-------------+--------------+-----------------------------------------------------------------------+ 13982 13983.. 13984 13985 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 13986 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 13987 13988 ========================== ============== ========= =============================================================== 13989 String Key Value Type Required? Description 13990 ========================== ============== ========= =============================================================== 13991 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 13992 ".scratch_memory_size" integer Scratch memory size in bytes. 13993 ".lds_size" integer Local Data Share size in bytes. 13994 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 13995 ".vgpr_count" integer Number of VGPRs used. 13996 ".agpr_count" integer Number of AGPRs used. 13997 ".sgpr_count" integer Number of SGPRs used. 13998 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 13999 directive to instruct the compiler to limit the VGPR usage to 14000 be less than or equal to the specified value (only set if 14001 different from HW default). 14002 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 14003 default). 14004 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 14005 3 integers 14006 ".wavefront_size" integer Wavefront size (only set if different from HW default). 14007 ".uses_uavs" boolean The shader reads or writes UAVs. 14008 ".uses_rovs" boolean The shader reads or writes ROVs. 14009 ".writes_uavs" boolean The shader writes to one or more UAVs. 14010 ".writes_depth" boolean The shader writes out a depth value. 14011 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 14012 memory or GDS. 14013 ".uses_prim_id" boolean The shader uses PrimID. 14014 ========================== ============== ========= =============================================================== 14015 14016.. 14017 14018 .. table:: AMDPAL Code Object Shader Function Map 14019 :name: amdgpu-amdpal-code-object-shader-function-map-table 14020 14021 =============== ============== ==================================================================== 14022 String Key Value Type Description 14023 =============== ============== ==================================================================== 14024 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 14025 entry address. The value is the function's metadata. See 14026 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 14027 =============== ============== ==================================================================== 14028 14029.. 14030 14031 .. table:: AMDPAL Code Object Shader Function Metadata Map 14032 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 14033 14034 ============================= ============== ================================================================= 14035 String Key Value Type Description 14036 ============================= ============== ================================================================= 14037 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 14038 2 integers is implementation defined, and can not be relied on between 14039 different builds of the compiler. 14040 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 14041 ".lds_size" integer Size in bytes of LDS memory. 14042 ".vgpr_count" integer Number of VGPRs used by the shader. 14043 ".sgpr_count" integer Number of SGPRs used by the shader. 14044 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 14045 ".shader_subtype" string Shader subtype/kind. Values include: 14046 14047 - "Unknown" 14048 14049 ============================= ============== ================================================================= 14050 14051.. 14052 14053 .. table:: AMDPAL Code Object Register Map 14054 :name: amdgpu-amdpal-code-object-register-map-table 14055 14056 ========================== ============== ==================================================================== 14057 32-bit Integer Key Value Type Description 14058 ========================== ============== ==================================================================== 14059 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 14060 a GRBM register (i.e., driver accessible GPU register number, not 14061 shader GPR register number). The driver is required to program each 14062 specified register to the corresponding specified value when 14063 executing this pipeline. Typically, the ``reg offsets`` are the 14064 ``uint16_t`` offsets to each register as defined by the hardware 14065 chip headers. The register is set to the provided value. However, a 14066 ``reg offset`` that specifies a user data register (e.g., 14067 COMPUTE_USER_DATA_0) needs special treatment. See 14068 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 14069 information. 14070 ========================== ============== ==================================================================== 14071 14072.. _amdgpu-amdpal-code-object-user-data-section: 14073 14074User Data 14075+++++++++ 14076 14077Each hardware stage has a set of 32-bit physical SPI *user data registers* 14078(either 16 or 32 based on graphics IP and the stage) which can be 14079written from a command buffer and then loaded into SGPRs when waves are 14080launched via a subsequent dispatch or draw operation. This is the way 14081most arguments are passed from the application/runtime to a hardware 14082shader. 14083 14084PAL abstracts this functionality by exposing a set of 128 *user data 14085entries* per pipeline a client can use to pass arguments from a command 14086buffer to one or more shaders in that pipeline. The ELF code object must 14087specify a mapping from virtualized *user data entries* to physical *user 14088data registers*, and PAL is responsible for implementing that mapping, 14089including spilling overflow *user data entries* to memory if needed. 14090 14091Since the *user data registers* are GRBM-accessible SPI registers, this 14092mapping is actually embedded in the ``.registers`` metadata entry. For 14093most registers, the value in that map is a literal 32-bit value that 14094should be written to the register by the driver. However, when the 14095register is a *user data register* (any USER_DATA register e.g., 14096SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 14097the driver to write either a *user data entry* value or one of several 14098driver-internal values to the register. This encoding is described in 14099the following table: 14100 14101.. note:: 14102 14103 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 14104 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 14105 always be programmed to the address of the GlobalTable, and *user data 14106 register* 1 must always be programmed to the address of the PerShaderTable. 14107 14108.. 14109 14110 .. table:: AMDPAL User Data Mapping 14111 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 14112 14113 ========== ================= =============================================================================== 14114 Value Name Description 14115 ========== ================= =============================================================================== 14116 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 14117 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 14118 always point to *user data register* 0). 14119 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 14120 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 14121 for more detail (should always point to *user data register* 1). 14122 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 14123 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 14124 more detail. 14125 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 14126 reference the draw index in the vertex shader. Only supported by the first 14127 stage in a graphics pipeline. 14128 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 14129 a graphics pipeline. 14130 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 14131 graphics pipeline. 14132 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 14133 a buffer containing the grid dimensions for a Compute dispatch operation. The 14134 high half of the address is stored in the next sequential user-SGPR. Only 14135 supported by compute pipelines. 14136 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 14137 space used for the ES/GS pseudo-ring-buffer for passing data between shader 14138 stages. 14139 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 14140 pipeline instancing. 14141 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 14142 can only appear for one shader stage per pipeline. 14143 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 14144 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 14145 only appear for one shader stage per pipeline. 14146 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 14147 only appear for one shader stage per pipeline (PS). These replace color targets 14148 and are completely separate from any UAVs used by the shader. This is optional, 14149 and only used by the PS when UAV exports are used to replace color-target 14150 exports to optimize specific shaders. 14151 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 14152 some NGG pipelines to perform culling. This value contains the address of the 14153 first of two consecutive registers which provide the full GPU address. 14154 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 14155 ========== ================= =============================================================================== 14156 14157.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 14158 14159Per-Shader Table 14160################ 14161 14162Low 32 bits of the GPU address for an optional buffer in the ``.data`` 14163section of the ELF. The high 32 bits of the address match the high 32 bits 14164of the shader's program counter. 14165 14166The buffer can be anything the shader compiler needs it for, and 14167allows each shader to have its own region of the ``.data`` section. 14168Typically, this could be a table of buffer SRD's and the data pointed to 14169by the buffer SRD's, but it could be a flat-address region of memory as 14170well. Its layout and usage are defined by the shader compiler. 14171 14172Each shader's table in the ``.data`` section is referenced by the symbol 14173``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 14174hardware shader stage the data is for. E.g., 14175``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 14176 14177.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 14178 14179Spill Table 14180########### 14181 14182It is possible for a hardware shader to need access to more *user data 14183entries* than there are slots available in user data registers for one 14184or more hardware shader stages. In that case, the PAL runtime expects 14185the necessary *user data entries* to be spilled to GPU memory and use 14186one user data register to point to the spilled user data memory. The 14187value of the *user data entry* must then represent the location where 14188a shader expects to read the low 32-bits of the table's GPU virtual 14189address. The *spill table* itself represents a set of 32-bit values 14190managed by the PAL runtime in GPU-accessible memory that can be made 14191indirectly accessible to a hardware shader. 14192 14193Unspecified OS 14194-------------- 14195 14196This section provides code conventions used when the target triple OS is 14197empty (see :ref:`amdgpu-target-triples`). 14198 14199Trap Handler ABI 14200~~~~~~~~~~~~~~~~ 14201 14202For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 14203not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 14204instructions are handled as follows: 14205 14206 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 14207 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 14208 14209 =============== =============== =========================================== 14210 Usage Code Sequence Description 14211 =============== =============== =========================================== 14212 llvm.trap s_endpgm Causes wavefront to be terminated. 14213 llvm.debugtrap *none* Compiler warning given that there is no 14214 trap handler installed. 14215 =============== =============== =========================================== 14216 14217Source Languages 14218================ 14219 14220.. _amdgpu-opencl: 14221 14222OpenCL 14223------ 14224 14225When the language is OpenCL the following differences occur: 14226 142271. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 142282. The AMDGPU backend appends additional arguments to the kernel's explicit 14229 arguments for the AMDHSA OS (see 14230 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 142313. Additional metadata is generated 14232 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 14233 14234 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 14235 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 14236 14237 ======== ==== ========= =========================================== 14238 Position Byte Byte Description 14239 Size Alignment 14240 ======== ==== ========= =========================================== 14241 1 8 8 OpenCL Global Offset X 14242 2 8 8 OpenCL Global Offset Y 14243 3 8 8 OpenCL Global Offset Z 14244 4 8 8 OpenCL address of printf buffer 14245 5 8 8 OpenCL address of virtual queue used by 14246 enqueue_kernel. 14247 6 8 8 OpenCL address of AqlWrap struct used by 14248 enqueue_kernel. 14249 7 8 8 Pointer argument used for Multi-gird 14250 synchronization. 14251 ======== ==== ========= =========================================== 14252 14253.. _amdgpu-hcc: 14254 14255HCC 14256--- 14257 14258When the language is HCC the following differences occur: 14259 142601. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 14261 14262.. _amdgpu-assembler: 14263 14264Assembler 14265--------- 14266 14267AMDGPU backend has LLVM-MC based assembler which is currently in development. 14268It supports AMDGCN GFX6-GFX11. 14269 14270This section describes general syntax for instructions and operands. 14271 14272Instructions 14273~~~~~~~~~~~~ 14274 14275An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 14276 14277 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 14278 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 14279 14280:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 14281:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 14282 14283The order of operands and modifiers is fixed. 14284Most modifiers are optional and may be omitted. 14285 14286Links to detailed instruction syntax description may be found in the following 14287table. Note that features under development are not included 14288in this description. 14289 14290 ============= ============================================= ======================================= 14291 Architecture Core ISA ISA Variants and Extensions 14292 ============= ============================================= ======================================= 14293 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 14294 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 14295 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 14296 14297 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 14298 14299 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 14300 14301 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 14302 14303 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 14304 14305 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>` 14306 14307 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 14308 14309 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 14310 14311 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>` 14312 14313 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>` 14314 14315 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 14316 14317 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 14318 14319 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>` 14320 14321 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>` 14322 14323 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>` 14324 14325 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>` 14326 14327 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>` 14328 14329 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>` 14330 14331 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>` 14332 14333 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>` 14334 ============= ============================================= ======================================= 14335 14336For more information about instructions, their semantics and supported 14337combinations of operands, refer to one of instruction set architecture manuals 14338[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 14339[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_, 14340[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_ and 14341[AMD-GCN-GFX10-RDNA2]_. 14342 14343Operands 14344~~~~~~~~ 14345 14346Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 14347 14348Modifiers 14349~~~~~~~~~ 14350 14351Detailed description of modifiers may be found 14352:doc:`here<AMDGPUModifierSyntax>`. 14353 14354Instruction Examples 14355~~~~~~~~~~~~~~~~~~~~ 14356 14357DS 14358++ 14359 14360.. code-block:: nasm 14361 14362 ds_add_u32 v2, v4 offset:16 14363 ds_write_src2_b64 v2 offset0:4 offset1:8 14364 ds_cmpst_f32 v2, v4, v6 14365 ds_min_rtn_f64 v[8:9], v2, v[4:5] 14366 14367For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 14368Manual. 14369 14370FLAT 14371++++ 14372 14373.. code-block:: nasm 14374 14375 flat_load_dword v1, v[3:4] 14376 flat_store_dwordx3 v[3:4], v[5:7] 14377 flat_atomic_swap v1, v[3:4], v5 glc 14378 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 14379 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 14380 14381For full list of supported instructions, refer to "FLAT instructions" in ISA 14382Manual. 14383 14384MUBUF 14385+++++ 14386 14387.. code-block:: nasm 14388 14389 buffer_load_dword v1, off, s[4:7], s1 14390 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 14391 buffer_store_format_xy v[1:2], off, s[4:7], s1 14392 buffer_wbinvl1 14393 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 14394 14395For full list of supported instructions, refer to "MUBUF Instructions" in ISA 14396Manual. 14397 14398SMRD/SMEM 14399+++++++++ 14400 14401.. code-block:: nasm 14402 14403 s_load_dword s1, s[2:3], 0xfc 14404 s_load_dwordx8 s[8:15], s[2:3], s4 14405 s_load_dwordx16 s[88:103], s[2:3], s4 14406 s_dcache_inv_vol 14407 s_memtime s[4:5] 14408 14409For full list of supported instructions, refer to "Scalar Memory Operations" in 14410ISA Manual. 14411 14412SOP1 14413++++ 14414 14415.. code-block:: nasm 14416 14417 s_mov_b32 s1, s2 14418 s_mov_b64 s[0:1], 0x80000000 14419 s_cmov_b32 s1, 200 14420 s_wqm_b64 s[2:3], s[4:5] 14421 s_bcnt0_i32_b64 s1, s[2:3] 14422 s_swappc_b64 s[2:3], s[4:5] 14423 s_cbranch_join s[4:5] 14424 14425For full list of supported instructions, refer to "SOP1 Instructions" in ISA 14426Manual. 14427 14428SOP2 14429++++ 14430 14431.. code-block:: nasm 14432 14433 s_add_u32 s1, s2, s3 14434 s_and_b64 s[2:3], s[4:5], s[6:7] 14435 s_cselect_b32 s1, s2, s3 14436 s_andn2_b32 s2, s4, s6 14437 s_lshr_b64 s[2:3], s[4:5], s6 14438 s_ashr_i32 s2, s4, s6 14439 s_bfm_b64 s[2:3], s4, s6 14440 s_bfe_i64 s[2:3], s[4:5], s6 14441 s_cbranch_g_fork s[4:5], s[6:7] 14442 14443For full list of supported instructions, refer to "SOP2 Instructions" in ISA 14444Manual. 14445 14446SOPC 14447++++ 14448 14449.. code-block:: nasm 14450 14451 s_cmp_eq_i32 s1, s2 14452 s_bitcmp1_b32 s1, s2 14453 s_bitcmp0_b64 s[2:3], s4 14454 s_setvskip s3, s5 14455 14456For full list of supported instructions, refer to "SOPC Instructions" in ISA 14457Manual. 14458 14459SOPP 14460++++ 14461 14462.. code-block:: nasm 14463 14464 s_barrier 14465 s_nop 2 14466 s_endpgm 14467 s_waitcnt 0 ; Wait for all counters to be 0 14468 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 14469 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 14470 s_sethalt 9 14471 s_sleep 10 14472 s_sendmsg 0x1 14473 s_sendmsg sendmsg(MSG_INTERRUPT) 14474 s_trap 1 14475 14476For full list of supported instructions, refer to "SOPP Instructions" in ISA 14477Manual. 14478 14479Unless otherwise mentioned, little verification is performed on the operands 14480of SOPP Instructions, so it is up to the programmer to be familiar with the 14481range or acceptable values. 14482 14483VALU 14484++++ 14485 14486For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 14487the assembler will automatically use optimal encoding based on its operands. To 14488force specific encoding, one can add a suffix to the opcode of the instruction: 14489 14490* _e32 for 32-bit VOP1/VOP2/VOPC 14491* _e64 for 64-bit VOP3 14492* _dpp for VOP_DPP 14493* _sdwa for VOP_SDWA 14494 14495VOP1/VOP2/VOP3/VOPC examples: 14496 14497.. code-block:: nasm 14498 14499 v_mov_b32 v1, v2 14500 v_mov_b32_e32 v1, v2 14501 v_nop 14502 v_cvt_f64_i32_e32 v[1:2], v2 14503 v_floor_f32_e32 v1, v2 14504 v_bfrev_b32_e32 v1, v2 14505 v_add_f32_e32 v1, v2, v3 14506 v_mul_i32_i24_e64 v1, v2, 3 14507 v_mul_i32_i24_e32 v1, -3, v3 14508 v_mul_i32_i24_e32 v1, -100, v3 14509 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 14510 v_max_f16_e32 v1, v2, v3 14511 14512VOP_DPP examples: 14513 14514.. code-block:: nasm 14515 14516 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 14517 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14518 v_mov_b32 v0, v0 wave_shl:1 14519 v_mov_b32 v0, v0 row_mirror 14520 v_mov_b32 v0, v0 row_bcast:31 14521 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 14522 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14523 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14524 14525VOP_SDWA examples: 14526 14527.. code-block:: nasm 14528 14529 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 14530 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 14531 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 14532 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 14533 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 14534 14535For full list of supported instructions, refer to "Vector ALU instructions". 14536 14537.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 14538 14539Code Object V2 Predefined Symbols 14540~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14541 14542.. warning:: 14543 Code object V2 is not the default code object version emitted by 14544 this version of LLVM. 14545 14546The AMDGPU assembler defines and updates some symbols automatically. These 14547symbols do not affect code generation. 14548 14549.option.machine_version_major 14550+++++++++++++++++++++++++++++ 14551 14552Set to the GFX major generation number of the target being assembled for. For 14553example, when assembling for a "GFX9" target this will be set to the integer 14554value "9". The possible GFX major generation numbers are presented in 14555:ref:`amdgpu-processors`. 14556 14557.option.machine_version_minor 14558+++++++++++++++++++++++++++++ 14559 14560Set to the GFX minor generation number of the target being assembled for. For 14561example, when assembling for a "GFX810" target this will be set to the integer 14562value "1". The possible GFX minor generation numbers are presented in 14563:ref:`amdgpu-processors`. 14564 14565.option.machine_version_stepping 14566++++++++++++++++++++++++++++++++ 14567 14568Set to the GFX stepping generation number of the target being assembled for. 14569For example, when assembling for a "GFX704" target this will be set to the 14570integer value "4". The possible GFX stepping generation numbers are presented 14571in :ref:`amdgpu-processors`. 14572 14573.kernel.vgpr_count 14574++++++++++++++++++ 14575 14576Set to zero each time a 14577:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14578encountered. At each instruction, if the current value of this symbol is less 14579than or equal to the maximum VGPR number explicitly referenced within that 14580instruction then the symbol value is updated to equal that VGPR number plus 14581one. 14582 14583.kernel.sgpr_count 14584++++++++++++++++++ 14585 14586Set to zero each time a 14587:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14588encountered. At each instruction, if the current value of this symbol is less 14589than or equal to the maximum VGPR number explicitly referenced within that 14590instruction then the symbol value is updated to equal that SGPR number plus 14591one. 14592 14593.. _amdgpu-amdhsa-assembler-directives-v2: 14594 14595Code Object V2 Directives 14596~~~~~~~~~~~~~~~~~~~~~~~~~ 14597 14598.. warning:: 14599 Code object V2 is not the default code object version emitted by 14600 this version of LLVM. 14601 14602AMDGPU ABI defines auxiliary data in output code object. In assembly source, 14603one can specify them with assembler directives. 14604 14605.hsa_code_object_version major, minor 14606+++++++++++++++++++++++++++++++++++++ 14607 14608*major* and *minor* are integers that specify the version of the HSA code 14609object that will be generated by the assembler. 14610 14611.hsa_code_object_isa [major, minor, stepping, vendor, arch] 14612+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 14613 14614 14615*major*, *minor*, and *stepping* are all integers that describe the instruction 14616set architecture (ISA) version of the assembly program. 14617 14618*vendor* and *arch* are quoted strings. *vendor* should always be equal to 14619"AMD" and *arch* should always be equal to "AMDGPU". 14620 14621By default, the assembler will derive the ISA version, *vendor*, and *arch* 14622from the value of the -mcpu option that is passed to the assembler. 14623 14624.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 14625 14626.amdgpu_hsa_kernel (name) 14627+++++++++++++++++++++++++ 14628 14629This directives specifies that the symbol with given name is a kernel entry 14630point (label) and the object should contain corresponding symbol of type 14631STT_AMDGPU_HSA_KERNEL. 14632 14633.amd_kernel_code_t 14634++++++++++++++++++ 14635 14636This directive marks the beginning of a list of key / value pairs that are used 14637to specify the amd_kernel_code_t object that will be emitted by the assembler. 14638The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 14639amd_kernel_code_t values that are unspecified a default value will be used. The 14640default value for all keys is 0, with the following exceptions: 14641 14642- *amd_code_version_major* defaults to 1. 14643- *amd_kernel_code_version_minor* defaults to 2. 14644- *amd_machine_kind* defaults to 1. 14645- *amd_machine_version_major*, *machine_version_minor*, and 14646 *amd_machine_version_stepping* are derived from the value of the -mcpu option 14647 that is passed to the assembler. 14648- *kernel_code_entry_byte_offset* defaults to 256. 14649- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 14650 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 14651 Note that wavefront size is specified as a power of two, so a value of **n** 14652 means a size of 2^ **n**. 14653- *call_convention* defaults to -1. 14654- *kernarg_segment_alignment*, *group_segment_alignment*, and 14655 *private_segment_alignment* default to 4. Note that alignments are specified 14656 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 14657- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 14658 GFX90A onwards. 14659- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 14660 GFX10 onwards. 14661- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 14662 14663The *.amd_kernel_code_t* directive must be placed immediately after the 14664function label and before any instructions. 14665 14666For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 14667comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 14668 14669.. _amdgpu-amdhsa-assembler-example-v2: 14670 14671Code Object V2 Example Source Code 14672~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14673 14674.. warning:: 14675 Code Object V2 is not the default code object version emitted by 14676 this version of LLVM. 14677 14678Here is an example of a minimal assembly source file, defining one HSA kernel: 14679 14680.. code:: 14681 :number-lines: 14682 14683 .hsa_code_object_version 1,0 14684 .hsa_code_object_isa 14685 14686 .hsatext 14687 .globl hello_world 14688 .p2align 8 14689 .amdgpu_hsa_kernel hello_world 14690 14691 hello_world: 14692 14693 .amd_kernel_code_t 14694 enable_sgpr_kernarg_segment_ptr = 1 14695 is_ptr64 = 1 14696 compute_pgm_rsrc1_vgprs = 0 14697 compute_pgm_rsrc1_sgprs = 0 14698 compute_pgm_rsrc2_user_sgpr = 2 14699 compute_pgm_rsrc1_wgp_mode = 0 14700 compute_pgm_rsrc1_mem_ordered = 0 14701 compute_pgm_rsrc1_fwd_progress = 1 14702 .end_amd_kernel_code_t 14703 14704 s_load_dwordx2 s[0:1], s[0:1] 0x0 14705 v_mov_b32 v0, 3.14159 14706 s_waitcnt lgkmcnt(0) 14707 v_mov_b32 v1, s0 14708 v_mov_b32 v2, s1 14709 flat_store_dword v[1:2], v0 14710 s_endpgm 14711 .Lfunc_end0: 14712 .size hello_world, .Lfunc_end0-hello_world 14713 14714.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards: 14715 14716Code Object V3 and Above Predefined Symbols 14717~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14718 14719The AMDGPU assembler defines and updates some symbols automatically. These 14720symbols do not affect code generation. 14721 14722.amdgcn.gfx_generation_number 14723+++++++++++++++++++++++++++++ 14724 14725Set to the GFX major generation number of the target being assembled for. For 14726example, when assembling for a "GFX9" target this will be set to the integer 14727value "9". The possible GFX major generation numbers are presented in 14728:ref:`amdgpu-processors`. 14729 14730.amdgcn.gfx_generation_minor 14731++++++++++++++++++++++++++++ 14732 14733Set to the GFX minor generation number of the target being assembled for. For 14734example, when assembling for a "GFX810" target this will be set to the integer 14735value "1". The possible GFX minor generation numbers are presented in 14736:ref:`amdgpu-processors`. 14737 14738.amdgcn.gfx_generation_stepping 14739+++++++++++++++++++++++++++++++ 14740 14741Set to the GFX stepping generation number of the target being assembled for. 14742For example, when assembling for a "GFX704" target this will be set to the 14743integer value "4". The possible GFX stepping generation numbers are presented 14744in :ref:`amdgpu-processors`. 14745 14746.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 14747 14748.amdgcn.next_free_vgpr 14749++++++++++++++++++++++ 14750 14751Set to zero before assembly begins. At each instruction, if the current value 14752of this symbol is less than or equal to the maximum VGPR number explicitly 14753referenced within that instruction then the symbol value is updated to equal 14754that VGPR number plus one. 14755 14756May be used to set the `.amdhsa_next_free_vgpr` directive in 14757:ref:`amdhsa-kernel-directives-table`. 14758 14759May be set at any time, e.g. manually set to zero at the start of each kernel. 14760 14761.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 14762 14763.amdgcn.next_free_sgpr 14764++++++++++++++++++++++ 14765 14766Set to zero before assembly begins. At each instruction, if the current value 14767of this symbol is less than or equal the maximum SGPR number explicitly 14768referenced within that instruction then the symbol value is updated to equal 14769that SGPR number plus one. 14770 14771May be used to set the `.amdhsa_next_free_spgr` directive in 14772:ref:`amdhsa-kernel-directives-table`. 14773 14774May be set at any time, e.g. manually set to zero at the start of each kernel. 14775 14776.. _amdgpu-amdhsa-assembler-directives-v3-onwards: 14777 14778Code Object V3 and Above Directives 14779~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14780 14781Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 14782architecture processors, and are not OS-specific. Directives which begin with 14783``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 14784``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 14785:ref:`amdgpu-processors`. 14786 14787.. _amdgpu-assembler-directive-amdgcn-target: 14788 14789.amdgcn_target <target-triple> "-" <target-id> 14790++++++++++++++++++++++++++++++++++++++++++++++ 14791 14792Optional directive which declares the ``<target-triple>-<target-id>`` supported 14793by the containing assembler source file. Used by the assembler to validate 14794command-line options such as ``-triple``, ``-mcpu``, and 14795``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 14796:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 14797 14798.. note:: 14799 14800 The target ID syntax used for code object V2 to V3 for this directive differs 14801 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 14802 14803.amdhsa_kernel <name> 14804+++++++++++++++++++++ 14805 14806Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 14807``<name>.kd``, in the current location of the current section. Only valid when 14808the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 14809instruction to execute, and does not need to be previously defined. 14810 14811Marks the beginning of a list of directives used to generate the bytes of a 14812kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 14813Directives which may appear in this list are described in 14814:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 14815be valid for the target being assembled for, and cannot be repeated. Directives 14816support the range of values specified by the field they reference in 14817:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 14818assumed to have its default value, unless it is marked as "Required", in which 14819case it is an error to omit the directive. This list of directives is 14820terminated by an ``.end_amdhsa_kernel`` directive. 14821 14822 .. table:: AMDHSA Kernel Assembler Directives 14823 :name: amdhsa-kernel-directives-table 14824 14825 ======================================================== =================== ============ =================== 14826 Directive Default Supported On Description 14827 ======================================================== =================== ============ =================== 14828 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX11 Controls GROUP_SEGMENT_FIXED_SIZE in 14829 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14830 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX11 Controls PRIVATE_SEGMENT_FIXED_SIZE in 14831 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14832 ``.amdhsa_kernarg_size`` 0 GFX6-GFX11 Controls KERNARG_SIZE in 14833 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14834 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX11 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2 14835 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` 14836 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 14837 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14838 GFX940) 14839 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_PTR in 14840 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14841 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_QUEUE_PTR in 14842 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14843 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 14844 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14845 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_ID in 14846 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14847 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 14848 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14849 GFX940) 14850 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX11 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 14851 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14852 ``.amdhsa_wavefront_size32`` Target GFX10-GFX11 Controls ENABLE_WAVEFRONT_SIZE32 in 14853 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14854 Specific 14855 (wavefrontsize64) 14856 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX11 Controls USES_DYNAMIC_STACK in 14857 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14858 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 14859 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14860 GFX940) 14861 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in 14862 GFX11 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14863 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_X in 14864 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14865 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 14866 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14867 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 14868 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14869 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_INFO in 14870 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14871 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX11 Controls ENABLE_VGPR_WORKITEM_ID in 14872 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14873 Possible values are defined in 14874 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 14875 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX11 Maximum VGPR number explicitly referenced, plus one. 14876 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 14877 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14878 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX11 Maximum SGPR number explicitly referenced, plus one. 14879 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14880 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14881 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file. 14882 GFX940 Used to calculate ACCUM_OFFSET in 14883 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14884 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX11 Whether the kernel may use the special VCC SGPR. 14885 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14886 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14887 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 14888 (except scratch memory. Used to calculate 14889 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in 14890 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14891 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 14892 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14893 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14894 (xnack) 14895 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_32 in 14896 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14897 Possible values are defined in 14898 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14899 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_16_64 in 14900 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14901 Possible values are defined in 14902 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14903 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX11 Controls FLOAT_DENORM_MODE_32 in 14904 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14905 Possible values are defined in 14906 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14907 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX11 Controls FLOAT_DENORM_MODE_16_64 in 14908 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14909 Possible values are defined in 14910 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14911 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in 14912 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14913 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in 14914 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14915 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX11 Controls FP16_OVFL in 14916 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14917 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in 14918 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14919 Specific GFX11 14920 (tgsplit) 14921 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX11 Controls ENABLE_WGP_MODE in 14922 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14923 Specific 14924 (cumode) 14925 ``.amdhsa_memory_ordered`` 1 GFX10-GFX11 Controls MEM_ORDERED in 14926 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14927 ``.amdhsa_forward_progress`` 0 GFX10-GFX11 Controls FWD_PROGRESS in 14928 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14929 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in 14930 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 14931 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 14932 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14933 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 14934 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14935 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 14936 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14937 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 14938 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14939 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 14940 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14941 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 14942 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14943 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 14944 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14945 ======================================================== =================== ============ =================== 14946 14947.amdgpu_metadata 14948++++++++++++++++ 14949 14950Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 14951note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`). 14952 14953The contents must be in the [YAML]_ markup format, with the same structure and 14954semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 14955:ref:`amdgpu-amdhsa-code-object-metadata-v4` or 14956:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 14957 14958This directive is terminated by an ``.end_amdgpu_metadata`` directive. 14959 14960.. _amdgpu-amdhsa-assembler-example-v3-onwards: 14961 14962Code Object V3 and Above Example Source Code 14963~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14964 14965Here is an example of a minimal assembly source file, defining one HSA kernel: 14966 14967.. code:: 14968 :number-lines: 14969 14970 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 14971 14972 .text 14973 .globl hello_world 14974 .p2align 8 14975 .type hello_world,@function 14976 hello_world: 14977 s_load_dwordx2 s[0:1], s[0:1] 0x0 14978 v_mov_b32 v0, 3.14159 14979 s_waitcnt lgkmcnt(0) 14980 v_mov_b32 v1, s0 14981 v_mov_b32 v2, s1 14982 flat_store_dword v[1:2], v0 14983 s_endpgm 14984 .Lfunc_end0: 14985 .size hello_world, .Lfunc_end0-hello_world 14986 14987 .rodata 14988 .p2align 6 14989 .amdhsa_kernel hello_world 14990 .amdhsa_user_sgpr_kernarg_segment_ptr 1 14991 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 14992 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 14993 .end_amdhsa_kernel 14994 14995 .amdgpu_metadata 14996 --- 14997 amdhsa.version: 14998 - 1 14999 - 0 15000 amdhsa.kernels: 15001 - .name: hello_world 15002 .symbol: hello_world.kd 15003 .kernarg_segment_size: 48 15004 .group_segment_fixed_size: 0 15005 .private_segment_fixed_size: 0 15006 .kernarg_segment_align: 4 15007 .wavefront_size: 64 15008 .sgpr_count: 2 15009 .vgpr_count: 3 15010 .max_flat_workgroup_size: 256 15011 .args: 15012 - .size: 8 15013 .offset: 0 15014 .value_kind: global_buffer 15015 .address_space: global 15016 .actual_access: write_only 15017 //... 15018 .end_amdgpu_metadata 15019 15020This kernel is equivalent to the following HIP program: 15021 15022.. code:: 15023 :number-lines: 15024 15025 __global__ void hello_world(float *p) { 15026 *p = 3.14159f; 15027 } 15028 15029If an assembly source file contains multiple kernels and/or functions, the 15030:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 15031:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 15032the ``.set <symbol>, <expression>`` directive. For example, in the case of two 15033kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 15034to group the function with the kernel that calls it and reset the symbols 15035between the two connected components: 15036 15037.. code:: 15038 :number-lines: 15039 15040 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 15041 15042 // gpr tracking symbols are implicitly set to zero 15043 15044 .text 15045 .globl kern0 15046 .p2align 8 15047 .type kern0,@function 15048 kern0: 15049 // ... 15050 s_endpgm 15051 .Lkern0_end: 15052 .size kern0, .Lkern0_end-kern0 15053 15054 .rodata 15055 .p2align 6 15056 .amdhsa_kernel kern0 15057 // ... 15058 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 15059 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 15060 .end_amdhsa_kernel 15061 15062 // reset symbols to begin tracking usage in func1 and kern1 15063 .set .amdgcn.next_free_vgpr, 0 15064 .set .amdgcn.next_free_sgpr, 0 15065 15066 .text 15067 .hidden func1 15068 .global func1 15069 .p2align 2 15070 .type func1,@function 15071 func1: 15072 // ... 15073 s_setpc_b64 s[30:31] 15074 .Lfunc1_end: 15075 .size func1, .Lfunc1_end-func1 15076 15077 .globl kern1 15078 .p2align 8 15079 .type kern1,@function 15080 kern1: 15081 // ... 15082 s_getpc_b64 s[4:5] 15083 s_add_u32 s4, s4, func1@rel32@lo+4 15084 s_addc_u32 s5, s5, func1@rel32@lo+4 15085 s_swappc_b64 s[30:31], s[4:5] 15086 // ... 15087 s_endpgm 15088 .Lkern1_end: 15089 .size kern1, .Lkern1_end-kern1 15090 15091 .rodata 15092 .p2align 6 15093 .amdhsa_kernel kern1 15094 // ... 15095 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 15096 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 15097 .end_amdhsa_kernel 15098 15099These symbols cannot identify connected components in order to automatically 15100track the usage for each kernel. However, in some cases careful organization of 15101the kernels and functions in the source file means there is minimal additional 15102effort required to accurately calculate GPR usage. 15103 15104Additional Documentation 15105======================== 15106 15107.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 15108.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 15109.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 15110.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 15111.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 15112.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 15113.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__ 15114.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 15115.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 15116.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 15117.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 15118.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 15119.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 15120.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 15121.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 15122.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 15123.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 15124.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 15125.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 15126.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 15127.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 15128.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 15129.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 15130.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 15131.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 15132