1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX940 20 AMDGPU/AMDGPUAsmGFX10 21 AMDGPU/AMDGPUAsmGFX1011 22 AMDGPU/AMDGPUAsmGFX1013 23 AMDGPU/AMDGPUAsmGFX1030 24 AMDGPUModifierSyntax 25 AMDGPUOperandSyntax 26 AMDGPUInstructionSyntax 27 AMDGPUInstructionNotation 28 AMDGPUDwarfExtensionsForHeterogeneousDebugging 29 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack 30 31Introduction 32============ 33 34The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 35R600 family up until the current GCN families. It lives in the 36``llvm/lib/Target/AMDGPU`` directory. 37 38LLVM 39==== 40 41.. _amdgpu-target-triples: 42 43Target Triples 44-------------- 45 46Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 47to specify the target triple: 48 49 .. table:: AMDGPU Architectures 50 :name: amdgpu-architecture-table 51 52 ============ ============================================================== 53 Architecture Description 54 ============ ============================================================== 55 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 56 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 57 ============ ============================================================== 58 59 .. table:: AMDGPU Vendors 60 :name: amdgpu-vendor-table 61 62 ============ ============================================================== 63 Vendor Description 64 ============ ============================================================== 65 ``amd`` Can be used for all AMD GPU usage. 66 ``mesa3d`` Can be used if the OS is ``mesa3d``. 67 ============ ============================================================== 68 69 .. table:: AMDGPU Operating Systems 70 :name: amdgpu-os 71 72 ============== ============================================================ 73 OS Description 74 ============== ============================================================ 75 *<empty>* Defaults to the *unknown* OS. 76 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 77 such as: 78 79 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 80 loader on Linux. See *AMD ROCm Platform Release Notes* 81 [AMD-ROCm-Release-Notes]_ for supported hardware and 82 software. 83 - AMD's PAL runtime using the *pal-amdhsa* loader on 84 Windows. 85 86 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 87 runtime using the *pal-amdpal* loader on Windows and Linux 88 Pro. 89 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 90 3D runtime using the *mesa-mesa3d* loader on Linux. 91 ============== ============================================================ 92 93 .. table:: AMDGPU Environments 94 :name: amdgpu-environment-table 95 96 ============ ============================================================== 97 Environment Description 98 ============ ============================================================== 99 *<empty>* Default. 100 ============ ============================================================== 101 102.. _amdgpu-processors: 103 104Processors 105---------- 106 107Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 108specify the AMDGPU processor together with optional target features. See 109:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 110specific information. 111 112Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 113 114* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 115 116 117 .. table:: AMDGPU Processors 118 :name: amdgpu-processor-table 119 120 =========== =============== ============ ===== ================= =============== =============== ====================== 121 Processor Alternative Target dGPU/ Target Target OS Support Example 122 Processor Triple APU Features Properties *(see* Products 123 Architecture Supported `amdgpu-os`_ 124 *and 125 corresponding 126 runtime release 127 notes for 128 current 129 information and 130 level of 131 support)* 132 =========== =============== ============ ===== ================= =============== =============== ====================== 133 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 134 ----------------------------------------------------------------------------------------------------------------------- 135 ``r600`` ``r600`` dGPU - Does not 136 support 137 generic 138 address 139 space 140 ``r630`` ``r600`` dGPU - Does not 141 support 142 generic 143 address 144 space 145 ``rs880`` ``r600`` dGPU - Does not 146 support 147 generic 148 address 149 space 150 ``rv670`` ``r600`` dGPU - Does not 151 support 152 generic 153 address 154 space 155 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 156 ----------------------------------------------------------------------------------------------------------------------- 157 ``rv710`` ``r600`` dGPU - Does not 158 support 159 generic 160 address 161 space 162 ``rv730`` ``r600`` dGPU - Does not 163 support 164 generic 165 address 166 space 167 ``rv770`` ``r600`` dGPU - Does not 168 support 169 generic 170 address 171 space 172 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 173 ----------------------------------------------------------------------------------------------------------------------- 174 ``cedar`` ``r600`` dGPU - Does not 175 support 176 generic 177 address 178 space 179 ``cypress`` ``r600`` dGPU - Does not 180 support 181 generic 182 address 183 space 184 ``juniper`` ``r600`` dGPU - Does not 185 support 186 generic 187 address 188 space 189 ``redwood`` ``r600`` dGPU - Does not 190 support 191 generic 192 address 193 space 194 ``sumo`` ``r600`` dGPU - Does not 195 support 196 generic 197 address 198 space 199 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 200 ----------------------------------------------------------------------------------------------------------------------- 201 ``barts`` ``r600`` dGPU - Does not 202 support 203 generic 204 address 205 space 206 ``caicos`` ``r600`` dGPU - Does not 207 support 208 generic 209 address 210 space 211 ``cayman`` ``r600`` dGPU - Does not 212 support 213 generic 214 address 215 space 216 ``turks`` ``r600`` dGPU - Does not 217 support 218 generic 219 address 220 space 221 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 222 ----------------------------------------------------------------------------------------------------------------------- 223 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 224 support 225 generic 226 address 227 space 228 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 229 - ``verde`` support 230 generic 231 address 232 space 233 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 234 - ``oland`` support 235 generic 236 address 237 space 238 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 239 ----------------------------------------------------------------------------------------------------------------------- 240 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 241 flat - *pal-amdhsa* - A6 Pro-7050B 242 scratch - *pal-amdpal* - A8-7100 243 - A8 Pro-7150B 244 - A10-7300 245 - A10 Pro-7350B 246 - FX-7500 247 - A8-7200P 248 - A10-7400P 249 - FX-7600P 250 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 251 flat - *pal-amdhsa* - FirePro W9100 252 scratch - *pal-amdpal* - FirePro S9150 253 - FirePro S9170 254 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 255 flat - *pal-amdhsa* - Radeon R9 290x 256 scratch - *pal-amdpal* - Radeon R390 257 - Radeon R390x 258 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 259 - ``mullins`` flat - *pal-amdpal* - E1-2200 260 scratch - E1-2500 261 - E2-3000 262 - E2-3800 263 - A4-5000 264 - A4-5100 265 - A6-5200 266 - A4 Pro-3340B 267 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 268 flat - *pal-amdpal* - Radeon HD 8770 269 scratch - R7 260 270 - R7 260X 271 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 272 flat - *pal-amdpal* 273 scratch .. TODO:: 274 275 Add product 276 names. 277 278 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 279 ----------------------------------------------------------------------------------------------------------------------- 280 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 281 flat - *pal-amdhsa* - Pro A6-8500B 282 scratch - *pal-amdpal* - A8-8600P 283 - Pro A8-8600B 284 - FX-8800P 285 - Pro A12-8800B 286 - A10-8700P 287 - Pro A10-8700B 288 - A10-8780P 289 - A10-9600P 290 - A10-9630P 291 - A12-9700P 292 - A12-9730P 293 - FX-9800P 294 - FX-9830P 295 - E2-9010 296 - A6-9210 297 - A9-9410 298 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 299 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 300 scratch - *pal-amdpal* - Radeon R9 385 301 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 302 - *pal-amdhsa* - Radeon R9 Fury 303 - *pal-amdpal* - Radeon R9 FuryX 304 - Radeon Pro Duo 305 - FirePro S9300x2 306 - Radeon Instinct MI8 307 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 308 flat - *pal-amdhsa* - Radeon RX 480 309 scratch - *pal-amdpal* - Radeon Instinct MI6 310 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 311 flat - *pal-amdhsa* 312 scratch - *pal-amdpal* 313 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 314 flat - *pal-amdhsa* - FirePro S7100 315 scratch - *pal-amdpal* - FirePro W7100 316 - Mobile FirePro 317 M7170 318 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 319 flat - *pal-amdhsa* 320 scratch - *pal-amdpal* .. TODO:: 321 322 Add product 323 names. 324 325 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ 326 ----------------------------------------------------------------------------------------------------------------------- 327 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 328 flat - *pal-amdhsa* Frontier Edition 329 scratch - *pal-amdpal* - Radeon RX Vega 56 330 - Radeon RX Vega 64 331 - Radeon RX Vega 64 332 Liquid 333 - Radeon Instinct MI25 334 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 335 flat - *pal-amdhsa* - Ryzen 5 2400G 336 scratch - *pal-amdpal* 337 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 338 - *pal-amdhsa* 339 - *pal-amdpal* .. TODO:: 340 341 Add product 342 names. 343 344 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 345 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 346 scratch - *pal-amdpal* - Radeon VII 347 - Radeon Pro VII 348 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 349 - xnack - Absolute 350 flat 351 scratch 352 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 353 flat 354 scratch .. TODO:: 355 356 Add product 357 names. 358 359 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 360 - tgsplit flat 361 - xnack scratch .. TODO:: 362 - Packed 363 work-item Add product 364 IDs names. 365 366 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 367 flat - Ryzen 7 4700GE 368 scratch - Ryzen 5 4600G 369 - Ryzen 5 4600GE 370 - Ryzen 3 4300G 371 - Ryzen 3 4300GE 372 - Ryzen Pro 4000G 373 - Ryzen 7 Pro 4700G 374 - Ryzen 7 Pro 4750GE 375 - Ryzen 5 Pro 4650G 376 - Ryzen 5 Pro 4650GE 377 - Ryzen 3 Pro 4350G 378 - Ryzen 3 Pro 4350GE 379 380 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 381 - tgsplit flat 382 - xnack scratch .. TODO:: 383 - Packed 384 work-item Add product 385 IDs names. 386 387 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 388 ----------------------------------------------------------------------------------------------------------------------- 389 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 390 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 391 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 392 - Radeon Pro 5600M 393 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 394 - wavefrontsize64 - Absolute - *pal-amdhsa* 395 - xnack flat - *pal-amdpal* 396 scratch 397 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 398 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 399 - xnack scratch - *pal-amdpal* 400 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 401 - wavefrontsize64 flat - *pal-amdhsa* 402 - xnack scratch - *pal-amdpal* .. TODO:: 403 404 Add product 405 names. 406 407 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 408 ----------------------------------------------------------------------------------------------------------------------- 409 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 410 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 411 scratch - *pal-amdpal* - Radeon RX 6900 XT 412 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 413 - wavefrontsize64 flat - *pal-amdhsa* 414 scratch - *pal-amdpal* 415 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 416 - wavefrontsize64 flat - *pal-amdhsa* 417 scratch - *pal-amdpal* .. TODO:: 418 419 Add product 420 names. 421 422 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 423 - wavefrontsize64 flat 424 scratch .. TODO:: 425 426 Add product 427 names. 428 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 429 - wavefrontsize64 flat 430 scratch .. TODO:: 431 432 Add product 433 names. 434 435 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 436 - wavefrontsize64 flat 437 scratch .. TODO:: 438 Add product 439 names. 440 441 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 442 - wavefrontsize64 flat 443 scratch .. TODO:: 444 445 Add product 446 names. 447 448 **GCN GFX11** 449 ----------------------------------------------------------------------------------------------------------------------- 450 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA* 451 - wavefrontsize64 flat 452 scratch .. TODO:: 453 - Packed 454 work-item Add product 455 IDs names. 456 457 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA* 458 - wavefrontsize64 flat 459 scratch .. TODO:: 460 - Packed 461 work-item Add product 462 IDs names. 463 464 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA* 465 - wavefrontsize64 flat 466 scratch .. TODO:: 467 - Packed 468 work-item Add product 469 IDs names. 470 471 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA* 472 - wavefrontsize64 flat 473 scratch .. TODO:: 474 - Packed 475 work-item Add product 476 IDs names. 477 478 =========== =============== ============ ===== ================= =============== =============== ====================== 479 480.. _amdgpu-target-features: 481 482Target Features 483--------------- 484 485Target features control how code is generated to support certain 486processor specific features. Not all target features are supported by 487all processors. The runtime must ensure that the features supported by 488the device used to execute the code match the features enabled when 489generating the code. A mismatch of features may result in incorrect 490execution, or a reduction in performance. 491 492The target features supported by each processor is listed in 493:ref:`amdgpu-processor-table`. 494 495Target features are controlled by exactly one of the following Clang 496options: 497 498``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 499 500 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 501 optional components of the target ID. If omitted, the target feature has the 502 ``any`` value. See :ref:`amdgpu-target-id`. 503 504``-m[no-]<target-feature>`` 505 506 Target features not specified by the target ID are specified using a 507 separate option. These target features can have an ``on`` or ``off`` 508 value. ``on`` is specified by omitting the ``no-`` prefix, and 509 ``off`` is specified by including the ``no-`` prefix. The default 510 if not specified is ``off``. 511 512For example: 513 514``-mcpu=gfx908:xnack+`` 515 Enable the ``xnack`` feature. 516``-mcpu=gfx908:xnack-`` 517 Disable the ``xnack`` feature. 518``-mcumode`` 519 Enable the ``cumode`` feature. 520``-mno-cumode`` 521 Disable the ``cumode`` feature. 522 523 .. table:: AMDGPU Target Features 524 :name: amdgpu-target-features-table 525 526 =============== ============================ ================================================== 527 Target Feature Clang Option to Control Description 528 Name 529 =============== ============================ ================================================== 530 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 531 when generating code for kernels. When disabled 532 native WGP wavefront execution mode is used, 533 when enabled CU wavefront execution mode is used 534 (see :ref:`amdgpu-amdhsa-memory-model`). 535 536 sramecc - ``-mcpu`` If specified, generate code that can only be 537 - ``--offload-arch`` loaded and executed in a process that has a 538 matching setting for SRAMECC. 539 540 If not specified for code object V2 to V3, generate 541 code that can be loaded and executed in a process 542 with SRAMECC enabled. 543 544 If not specified for code object V4 or above, generate 545 code that can be loaded and executed in a process 546 with either setting of SRAMECC. 547 548 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 549 work-groups are launched in threadgroup split mode. 550 When enabled the waves of a work-group may be 551 launched in different CUs. 552 553 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 554 generating code for kernels. When disabled 555 native wavefront size 32 is used, when enabled 556 wavefront size 64 is used. 557 558 xnack - ``-mcpu`` If specified, generate code that can only be 559 - ``--offload-arch`` loaded and executed in a process that has a 560 matching setting for XNACK replay. 561 562 If not specified for code object V2 to V3, generate 563 code that can be loaded and executed in a process 564 with XNACK replay enabled. 565 566 If not specified for code object V4 or above, generate 567 code that can be loaded and executed in a process 568 with either setting of XNACK replay. 569 570 XNACK replay can be used for demand paging and 571 page migration. If enabled in the device, then if 572 a page fault occurs the code may execute 573 incorrectly unless generated with XNACK replay 574 enabled, or generated for code object V4 or above without 575 specifying XNACK replay. Executing code that was 576 generated with XNACK replay enabled, or generated 577 for code object V4 or above without specifying XNACK replay, 578 on a device that does not have XNACK replay 579 enabled will execute correctly but may be less 580 performant than code generated for XNACK replay 581 disabled. 582 =============== ============================ ================================================== 583 584.. _amdgpu-target-id: 585 586Target ID 587--------- 588 589AMDGPU supports target IDs. See `Clang Offload Bundler 590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 591description. The AMDGPU target specific information is: 592 593**processor** 594 Is an AMDGPU processor or alternative processor name specified in 595 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 596 the primary processor and alternative processor names. The canonical form 597 target ID only allow the primary processor name. 598 599**target-feature** 600 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 601 is supported by the processor. The target features supported by each processor 602 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 603 a target ID are marked as being controlled by ``-mcpu`` and 604 ``--offload-arch``. Each target feature must appear at most once in a target 605 ID. The non-canonical form target ID allows the target features to be 606 specified in any order. The canonical form target ID requires the target 607 features to be specified in alphabetic order. 608 609.. _amdgpu-target-id-v2-v3: 610 611Code Object V2 to V3 Target ID 612~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 613 614The target ID syntax for code object V2 to V3 is the same as defined in `Clang 615Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 616when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 617directive and the bundle entry ID. In those cases it has the following BNF 618syntax: 619 620.. code:: 621 622 <target-id> ::== <processor> ( "+" <target-feature> )* 623 624Where a target feature is omitted if *Off* and present if *On* or *Any*. 625 626.. note:: 627 628 The code object V2 to V3 cannot represent *Any* and treats it the same as 629 *On*. 630 631.. _amdgpu-embedding-bundled-objects: 632 633Embedding Bundled Code Objects 634------------------------------ 635 636AMDGPU supports the HIP and OpenMP languages that perform code object embedding 637as described in `Clang Offload Bundler 638<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 639 640.. note:: 641 642 The target ID syntax used for code object V2 to V3 for a bundle entry ID 643 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 644 645.. _amdgpu-address-spaces: 646 647Address Spaces 648-------------- 649 650The AMDGPU architecture supports a number of memory address spaces. The address 651space names use the OpenCL standard names, with some additions. 652 653The AMDGPU address spaces correspond to target architecture specific LLVM 654address space numbers used in LLVM IR. 655 656The AMDGPU address spaces are described in 657:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 658supported for the ``amdgcn`` target. 659 660 .. table:: AMDGPU Address Spaces 661 :name: amdgpu-address-spaces-table 662 663 ================================= =============== =========== ================ ======= ============================ 664 .. 64-Bit Process Address Space 665 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 666 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 667 Space Number Name Name Size 668 ================================= =============== =========== ================ ======= ============================ 669 Generic 0 flat flat 64 0x0000000000000000 670 Global 1 global global 64 0x0000000000000000 671 Region 2 N/A GDS 32 *not implemented for AMDHSA* 672 Local 3 group LDS 32 0xFFFFFFFF 673 Constant 4 constant *same as global* 64 0x0000000000000000 674 Private 5 private scratch 32 0xFFFFFFFF 675 Constant 32-bit 6 *TODO* 0x00000000 676 Buffer Fat Pointer (experimental) 7 *TODO* 677 ================================= =============== =========== ================ ======= ============================ 678 679**Generic** 680 The generic address space is supported unless the *Target Properties* column 681 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 682 space*. 683 684 The generic address space uses the hardware flat address support for two fixed 685 ranges of virtual addresses (the private and local apertures), that are 686 outside the range of addressable global memory, to map from a flat address to 687 a private or local address. This uses FLAT instructions that can take a flat 688 address and access global, private (scratch), and group (LDS) memory depending 689 on if the address is within one of the aperture ranges. 690 691 Flat access to scratch requires hardware aperture setup and setup in the 692 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 693 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 694 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 695 696 To convert between a private or group address space address (termed a segment 697 address) and a flat address the base address of the corresponding aperture 698 can be used. For GFX7-GFX8 these are available in the 699 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 700 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 701 GFX9-GFX11 the aperture base addresses are directly available as inline 702 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 703 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 704 aligned to 2^32 which makes it easier to convert from flat to segment or 705 segment to flat. 706 707 A global address space address has the same value when used as a flat address 708 so no conversion is needed. 709 710**Global and Constant** 711 The global and constant address spaces both use global virtual addresses, 712 which are the same virtual address space used by the CPU. However, some 713 virtual addresses may only be accessible to the CPU, some only accessible 714 by the GPU, and some by both. 715 716 Using the constant address space indicates that the data will not change 717 during the execution of the kernel. This allows scalar read instructions to 718 be used. As the constant address space could only be modified on the host 719 side, a generic pointer loaded from the constant address space is safe to be 720 assumed as a global pointer since only the device global memory is visible 721 and managed on the host side. The vector and scalar L1 caches are invalidated 722 of volatile data before each kernel dispatch execution to allow constant 723 memory to change values between kernel dispatches. 724 725**Region** 726 The region address space uses the hardware Global Data Store (GDS). All 727 wavefronts executing on the same device will access the same memory for any 728 given region address. However, the same region address accessed by wavefronts 729 executing on different devices will access different memory. It is higher 730 performance than global memory. It is allocated by the runtime. The data 731 store (DS) instructions can be used to access it. 732 733**Local** 734 The local address space uses the hardware Local Data Store (LDS) which is 735 automatically allocated when the hardware creates the wavefronts of a 736 work-group, and freed when all the wavefronts of a work-group have 737 terminated. All wavefronts belonging to the same work-group will access the 738 same memory for any given local address. However, the same local address 739 accessed by wavefronts belonging to different work-groups will access 740 different memory. It is higher performance than global memory. The data store 741 (DS) instructions can be used to access it. 742 743**Private** 744 The private address space uses the hardware scratch memory support which 745 automatically allocates memory when it creates a wavefront and frees it when 746 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 747 given private address will be different to the memory accessed by another lane 748 of the same or different wavefront for the same private address. 749 750 If a kernel dispatch uses scratch, then the hardware allocates memory from a 751 pool of backing memory allocated by the runtime for each wavefront. The lanes 752 of the wavefront access this using dword (4 byte) interleaving. The mapping 753 used from private address to backing memory address is: 754 755 ``wavefront-scratch-base + 756 ((private-address / 4) * wavefront-size * 4) + 757 (wavefront-lane-id * 4) + (private-address % 4)`` 758 759 If each lane of a wavefront accesses the same private address, the 760 interleaving results in adjacent dwords being accessed and hence requires 761 fewer cache lines to be fetched. 762 763 There are different ways that the wavefront scratch base address is 764 determined by a wavefront (see 765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 766 767 Scratch memory can be accessed in an interleaved manner using buffer 768 instructions with the scratch buffer descriptor and per wavefront scratch 769 offset, by the scratch instructions, or by flat instructions. Multi-dword 770 access is not supported except by flat and scratch instructions in 771 GFX9-GFX11. 772 773**Constant 32-bit** 774 *TODO* 775 776**Buffer Fat Pointer** 777 The buffer fat pointer is an experimental address space that is currently 778 unsupported in the backend. It exposes a non-integral pointer that is in 779 the future intended to support the modelling of 128-bit buffer descriptors 780 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 781 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 782 model the buffer descriptors used heavily in graphics workloads targeting 783 the backend. 784 785.. _amdgpu-memory-scopes: 786 787Memory Scopes 788------------- 789 790This section provides LLVM memory synchronization scopes supported by the AMDGPU 791backend memory model when the target triple OS is ``amdhsa`` (see 792:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 793 794The memory model supported is based on the HSA memory model [HSA]_ which is 795based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 796relation is transitive over the synchronizes-with relation independent of scope 797and synchronizes-with allows the memory scope instances to be inclusive (see 798table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 799 800This is different to the OpenCL [OpenCL]_ memory model which does not have scope 801inclusion and requires the memory scopes to exactly match. However, this 802is conservatively correct for OpenCL. 803 804 .. table:: AMDHSA LLVM Sync Scopes 805 :name: amdgpu-amdhsa-llvm-sync-scopes-table 806 807 ======================= =================================================== 808 LLVM Sync Scope Description 809 ======================= =================================================== 810 *none* The default: ``system``. 811 812 Synchronizes with, and participates in modification 813 and seq_cst total orderings with, other operations 814 (except image operations) for all address spaces 815 (except private, or generic that accesses private) 816 provided the other operation's sync scope is: 817 818 - ``system``. 819 - ``agent`` and executed by a thread on the same 820 agent. 821 - ``workgroup`` and executed by a thread in the 822 same work-group. 823 - ``wavefront`` and executed by a thread in the 824 same wavefront. 825 826 ``agent`` Synchronizes with, and participates in modification 827 and seq_cst total orderings with, other operations 828 (except image operations) for all address spaces 829 (except private, or generic that accesses private) 830 provided the other operation's sync scope is: 831 832 - ``system`` or ``agent`` and executed by a thread 833 on the same agent. 834 - ``workgroup`` and executed by a thread in the 835 same work-group. 836 - ``wavefront`` and executed by a thread in the 837 same wavefront. 838 839 ``workgroup`` Synchronizes with, and participates in modification 840 and seq_cst total orderings with, other operations 841 (except image operations) for all address spaces 842 (except private, or generic that accesses private) 843 provided the other operation's sync scope is: 844 845 - ``system``, ``agent`` or ``workgroup`` and 846 executed by a thread in the same work-group. 847 - ``wavefront`` and executed by a thread in the 848 same wavefront. 849 850 ``wavefront`` Synchronizes with, and participates in modification 851 and seq_cst total orderings with, other operations 852 (except image operations) for all address spaces 853 (except private, or generic that accesses private) 854 provided the other operation's sync scope is: 855 856 - ``system``, ``agent``, ``workgroup`` or 857 ``wavefront`` and executed by a thread in the 858 same wavefront. 859 860 ``singlethread`` Only synchronizes with and participates in 861 modification and seq_cst total orderings with, 862 other operations (except image operations) running 863 in the same thread for all address spaces (for 864 example, in signal handlers). 865 866 ``one-as`` Same as ``system`` but only synchronizes with other 867 operations within the same address space. 868 869 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 870 operations within the same address space. 871 872 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 873 other operations within the same address space. 874 875 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 876 other operations within the same address space. 877 878 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 879 other operations within the same address space. 880 ======================= =================================================== 881 882LLVM IR Intrinsics 883------------------ 884 885The AMDGPU backend implements the following LLVM IR intrinsics. 886 887*This section is WIP.* 888 889.. TODO:: 890 891 List AMDGPU intrinsics. 892 893LLVM IR Attributes 894------------------ 895 896The AMDGPU backend supports the following LLVM IR attributes. 897 898 .. table:: AMDGPU LLVM IR Attributes 899 :name: amdgpu-llvm-ir-attributes-table 900 901 ======================================= ========================================================== 902 LLVM Attribute Description 903 ======================================= ========================================================== 904 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 905 will be specified when the kernel is dispatched. Generated 906 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 907 The implied default value is 1,1024. 908 909 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 910 argument block size for the implicit arguments. This 911 varies by OS and language (for OpenCL see 912 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 913 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 914 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 915 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 916 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 917 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 918 execution unit. Generated by the ``amdgpu_waves_per_eu`` 919 CLANG attribute [CLANG-ATTR]_. This is an optimization hint, 920 and the backend may not be able to satisfy the request. If 921 the specified range is incompatible with the function's 922 "amdgpu-flat-work-group-size" value, the implied occupancy 923 bounds by the workgroup size takes precedence. 924 925 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 926 mode register to be set on entry. Overrides the default for 927 the calling convention. 928 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 929 the mode register to be set on entry. Overrides the default 930 for the calling convention. 931 932 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the 933 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this 934 attribute, or reached through a call site marked with this attribute, 935 the value returned by the intrinsic is undefined. The backend can 936 generally infer this during code generation, so typically there is no 937 benefit to frontends marking functions with this. 938 939 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the 940 llvm.amdgcn.workitem.id.y intrinsic. 941 942 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the 943 llvm.amdgcn.workitem.id.z intrinsic. 944 945 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the 946 llvm.amdgcn.workgroup.id.x intrinsic. 947 948 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the 949 llvm.amdgcn.workgroup.id.y intrinsic. 950 951 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the 952 llvm.amdgcn.workgroup.id.z intrinsic. 953 954 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the 955 llvm.amdgcn.dispatch.ptr intrinsic. 956 957 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the 958 llvm.amdgcn.implicitarg.ptr intrinsic. 959 960 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the 961 llvm.amdgcn.dispatch.id intrinsic. 962 963 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the 964 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint 965 attributes, the queue pointer may be required in situations where the 966 intrinsic call does not directly appear in the program. Some subtargets 967 require the queue pointer for to handle some addrspacecasts, as well 968 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and 969 llvm.debug intrinsics. 970 971 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 972 kernel argument that holds the pointer to the hostcall buffer. If this 973 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 974 975 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 976 kernel argument that holds the pointer to an initialized memory buffer 977 that conforms to the requirements of the malloc/free device library V1 978 version implementation. If this attribute is absent, then the 979 amdgpu-no-implicitarg-ptr is also removed. 980 981 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 982 kernel argument that holds the multigrid synchronization pointer. If this 983 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 984 ======================================= ========================================================== 985 986.. _amdgpu-elf-code-object: 987 988ELF Code Object 989=============== 990 991The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 992can be linked by ``lld`` to produce a standard ELF shared code object which can 993be loaded and executed on an AMDGPU target. 994 995.. _amdgpu-elf-header: 996 997Header 998------ 999 1000The AMDGPU backend uses the following ELF header: 1001 1002 .. table:: AMDGPU ELF Header 1003 :name: amdgpu-elf-header-table 1004 1005 ========================== =============================== 1006 Field Value 1007 ========================== =============================== 1008 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 1009 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 1010 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 1011 - ``ELFOSABI_AMDGPU_HSA`` 1012 - ``ELFOSABI_AMDGPU_PAL`` 1013 - ``ELFOSABI_AMDGPU_MESA3D`` 1014 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 1015 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 1016 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 1017 - ``ELFABIVERSION_AMDGPU_HSA_V5`` 1018 - ``ELFABIVERSION_AMDGPU_PAL`` 1019 - ``ELFABIVERSION_AMDGPU_MESA3D`` 1020 ``e_type`` - ``ET_REL`` 1021 - ``ET_DYN`` 1022 ``e_machine`` ``EM_AMDGPU`` 1023 ``e_entry`` 0 1024 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 1025 :ref:`amdgpu-elf-header-e_flags-table-v3`, 1026 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards` 1027 ========================== =============================== 1028 1029.. 1030 1031 .. table:: AMDGPU ELF Header Enumeration Values 1032 :name: amdgpu-elf-header-enumeration-values-table 1033 1034 =============================== ===== 1035 Name Value 1036 =============================== ===== 1037 ``EM_AMDGPU`` 224 1038 ``ELFOSABI_NONE`` 0 1039 ``ELFOSABI_AMDGPU_HSA`` 64 1040 ``ELFOSABI_AMDGPU_PAL`` 65 1041 ``ELFOSABI_AMDGPU_MESA3D`` 66 1042 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 1043 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 1044 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 1045 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3 1046 ``ELFABIVERSION_AMDGPU_PAL`` 0 1047 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 1048 =============================== ===== 1049 1050``e_ident[EI_CLASS]`` 1051 The ELF class is: 1052 1053 * ``ELFCLASS32`` for ``r600`` architecture. 1054 1055 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 1056 process address space applications. 1057 1058``e_ident[EI_DATA]`` 1059 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 1060 1061``e_ident[EI_OSABI]`` 1062 One of the following AMDGPU target architecture specific OS ABIs 1063 (see :ref:`amdgpu-os`): 1064 1065 * ``ELFOSABI_NONE`` for *unknown* OS. 1066 1067 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 1068 1069 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 1070 1071 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 1072 1073``e_ident[EI_ABIVERSION]`` 1074 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 1075 object conforms: 1076 1077 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 1078 runtime ABI for code object V2. Specify using the Clang option 1079 ``-mcode-object-version=2``. 1080 1081 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 1082 runtime ABI for code object V3. Specify using the Clang option 1083 ``-mcode-object-version=3``. 1084 1085 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 1086 runtime ABI for code object V4. Specify using the Clang option 1087 ``-mcode-object-version=4``. This is the default code object 1088 version if not specified. 1089 1090 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA 1091 runtime ABI for code object V5. Specify using the Clang option 1092 ``-mcode-object-version=5``. 1093 1094 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 1095 runtime ABI. 1096 1097 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 1098 3D runtime ABI. 1099 1100``e_type`` 1101 Can be one of the following values: 1102 1103 1104 ``ET_REL`` 1105 The type produced by the AMDGPU backend compiler as it is relocatable code 1106 object. 1107 1108 ``ET_DYN`` 1109 The type produced by the linker as it is a shared code object. 1110 1111 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 1112 1113``e_machine`` 1114 The value ``EM_AMDGPU`` is used for the machine for all processors supported 1115 by the ``r600`` and ``amdgcn`` architectures (see 1116 :ref:`amdgpu-processor-table`). The specific processor is specified in the 1117 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 1118 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1119 ``e_flags`` for code object V3 and above (see 1120 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1121 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`). 1122 1123``e_entry`` 1124 The entry point is 0 as the entry points for individual kernels must be 1125 selected in order to invoke them through AQL packets. 1126 1127``e_flags`` 1128 The AMDGPU backend uses the following ELF header flags: 1129 1130 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1131 :name: amdgpu-elf-header-e_flags-v2-table 1132 1133 ===================================== ===== ============================= 1134 Name Value Description 1135 ===================================== ===== ============================= 1136 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1137 target feature is 1138 enabled for all code 1139 contained in the code object. 1140 If the processor 1141 does not support the 1142 ``xnack`` target 1143 feature then must 1144 be 0. 1145 See 1146 :ref:`amdgpu-target-features`. 1147 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1148 handler is enabled for all 1149 code contained in the code 1150 object. If the processor 1151 does not support a trap 1152 handler then must be 0. 1153 See 1154 :ref:`amdgpu-target-features`. 1155 ===================================== ===== ============================= 1156 1157 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1158 :name: amdgpu-elf-header-e_flags-table-v3 1159 1160 ================================= ===== ============================= 1161 Name Value Description 1162 ================================= ===== ============================= 1163 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1164 mask for 1165 ``EF_AMDGPU_MACH_xxx`` values 1166 defined in 1167 :ref:`amdgpu-ef-amdgpu-mach-table`. 1168 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1169 target feature is 1170 enabled for all code 1171 contained in the code object. 1172 If the processor 1173 does not support the 1174 ``xnack`` target 1175 feature then must 1176 be 0. 1177 See 1178 :ref:`amdgpu-target-features`. 1179 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1180 target feature is 1181 enabled for all code 1182 contained in the code object. 1183 If the processor 1184 does not support the 1185 ``sramecc`` target 1186 feature then must 1187 be 0. 1188 See 1189 :ref:`amdgpu-target-features`. 1190 ================================= ===== ============================= 1191 1192 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After 1193 :name: amdgpu-elf-header-e_flags-table-v4-onwards 1194 1195 ============================================ ===== =================================== 1196 Name Value Description 1197 ============================================ ===== =================================== 1198 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1199 mask for 1200 ``EF_AMDGPU_MACH_xxx`` values 1201 defined in 1202 :ref:`amdgpu-ef-amdgpu-mach-table`. 1203 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1204 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1205 values. 1206 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1207 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1208 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1209 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1210 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1211 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1212 values. 1213 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1214 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1215 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1216 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1217 ============================================ ===== =================================== 1218 1219 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1220 :name: amdgpu-ef-amdgpu-mach-table 1221 1222 ==================================== ========== ============================= 1223 Name Value Description (see 1224 :ref:`amdgpu-processor-table`) 1225 ==================================== ========== ============================= 1226 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1227 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1228 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1229 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1230 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1231 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1232 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1233 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1234 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1235 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1236 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1237 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1238 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1239 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1240 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1241 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1242 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1243 *reserved* 0x011 - Reserved for ``r600`` 1244 0x01f architecture processors. 1245 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1246 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1247 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1248 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1249 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1250 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1251 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1252 *reserved* 0x027 Reserved. 1253 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1254 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1255 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1256 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1257 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1258 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1259 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1260 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1261 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1262 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1263 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1264 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1265 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1266 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1267 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1268 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1269 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1270 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1271 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1272 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1273 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1274 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 1275 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1276 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1277 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940`` 1278 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100`` 1279 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 1280 *reserved* 0x043 Reserved. 1281 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103`` 1282 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036`` 1283 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101`` 1284 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102`` 1285 ==================================== ========== ============================= 1286 1287Sections 1288-------- 1289 1290An AMDGPU target ELF code object has the standard ELF sections which include: 1291 1292 .. table:: AMDGPU ELF Sections 1293 :name: amdgpu-elf-sections-table 1294 1295 ================== ================ ================================= 1296 Name Type Attributes 1297 ================== ================ ================================= 1298 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1299 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1300 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1301 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1302 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1303 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1304 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1305 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1306 ``.note`` ``SHT_NOTE`` *none* 1307 ``.rela``\ *name* ``SHT_RELA`` *none* 1308 ``.rela.dyn`` ``SHT_RELA`` *none* 1309 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1310 ``.shstrtab`` ``SHT_STRTAB`` *none* 1311 ``.strtab`` ``SHT_STRTAB`` *none* 1312 ``.symtab`` ``SHT_SYMTAB`` *none* 1313 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1314 ================== ================ ================================= 1315 1316These sections have their standard meanings (see [ELF]_) and are only generated 1317if needed. 1318 1319``.debug``\ *\** 1320 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1321 information on the DWARF produced by the AMDGPU backend. 1322 1323``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1324 The standard sections used by a dynamic loader. 1325 1326``.note`` 1327 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1328 backend. 1329 1330``.rela``\ *name*, ``.rela.dyn`` 1331 For relocatable code objects, *name* is the name of the section that the 1332 relocation records apply. For example, ``.rela.text`` is the section name for 1333 relocation records associated with the ``.text`` section. 1334 1335 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1336 records from each of the relocatable code object's ``.rela``\ *name* sections. 1337 1338 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1339 the AMDGPU backend. 1340 1341``.text`` 1342 The executable machine code for the kernels and functions they call. Generated 1343 as position independent code. See :ref:`amdgpu-code-conventions` for 1344 information on conventions used in the isa generation. 1345 1346.. _amdgpu-note-records: 1347 1348Note Records 1349------------ 1350 1351The AMDGPU backend code object contains ELF note records in the ``.note`` 1352section. The set of generated notes and their semantics depend on the code 1353object version; see :ref:`amdgpu-note-records-v2` and 1354:ref:`amdgpu-note-records-v3-onwards`. 1355 1356As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1357must be generated after the ``name`` field to ensure the ``desc`` field is 4 1358byte aligned. In addition, minimal zero-byte padding must be generated to 1359ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1360field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1361alignment. 1362 1363.. _amdgpu-note-records-v2: 1364 1365Code Object V2 Note Records 1366~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1367 1368.. warning:: 1369 Code object V2 is not the default code object version emitted by 1370 this version of LLVM. 1371 1372The AMDGPU backend code object uses the following ELF note record in the 1373``.note`` section when compiling for code object V2. 1374 1375The note record vendor field is "AMD". 1376 1377Additional note records may be present, but any which are not documented here 1378are deprecated and should not be used. 1379 1380 .. table:: AMDGPU Code Object V2 ELF Note Records 1381 :name: amdgpu-elf-note-records-v2-table 1382 1383 ===== ===================================== ====================================== 1384 Name Type Description 1385 ===== ===================================== ====================================== 1386 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1387 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1388 Finalizer and not the LLVM compiler. 1389 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1390 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1391 YAML [YAML]_ textual format. 1392 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1393 ===== ===================================== ====================================== 1394 1395.. 1396 1397 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1398 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1399 1400 ===================================== ===== 1401 Name Value 1402 ===================================== ===== 1403 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1404 ``NT_AMD_HSA_HSAIL`` 2 1405 ``NT_AMD_HSA_ISA_VERSION`` 3 1406 *reserved* 4-9 1407 ``NT_AMD_HSA_METADATA`` 10 1408 ``NT_AMD_HSA_ISA_NAME`` 11 1409 ===================================== ===== 1410 1411``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1412 Specifies the code object version number. The description field has the 1413 following layout: 1414 1415 .. code:: c 1416 1417 struct amdgpu_hsa_note_code_object_version_s { 1418 uint32_t major_version; 1419 uint32_t minor_version; 1420 }; 1421 1422 The ``major_version`` has a value less than or equal to 2. 1423 1424``NT_AMD_HSA_HSAIL`` 1425 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1426 field has the following layout: 1427 1428 .. code:: c 1429 1430 struct amdgpu_hsa_note_hsail_s { 1431 uint32_t hsail_major_version; 1432 uint32_t hsail_minor_version; 1433 uint8_t profile; 1434 uint8_t machine_model; 1435 uint8_t default_float_round; 1436 }; 1437 1438``NT_AMD_HSA_ISA_VERSION`` 1439 Specifies the target ISA version. The description field has the following layout: 1440 1441 .. code:: c 1442 1443 struct amdgpu_hsa_note_isa_s { 1444 uint16_t vendor_name_size; 1445 uint16_t architecture_name_size; 1446 uint32_t major; 1447 uint32_t minor; 1448 uint32_t stepping; 1449 char vendor_and_architecture_name[1]; 1450 }; 1451 1452 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1453 vendor and architecture names respectively, including the NUL character. 1454 1455 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1456 vendor, immediately followed by the NUL terminated string for the 1457 architecture. 1458 1459 This note record is used by the HSA runtime loader. 1460 1461 Code object V2 only supports a limited number of processors and has fixed 1462 settings for target features. See 1463 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1464 processors and the corresponding target ID. In the table the note record ISA 1465 name is a concatenation of the vendor name, architecture name, major, minor, 1466 and stepping separated by a ":". 1467 1468 The target ID column shows the processor name and fixed target features used 1469 by the LLVM compiler. The LLVM compiler does not generate a 1470 ``NT_AMD_HSA_HSAIL`` note record. 1471 1472 A code object generated by the Finalizer also uses code object V2 and always 1473 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1474 ``sramecc`` target feature is as shown in 1475 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1476 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1477 bit. 1478 1479``NT_AMD_HSA_ISA_NAME`` 1480 Specifies the target ISA name as a non-NUL terminated string. 1481 1482 This note record is not used by the HSA runtime loader. 1483 1484 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1485 V2's limited support of processors and fixed settings for target features. 1486 1487 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1488 from the string to the corresponding target ID. If the ``xnack`` target 1489 feature is supported and enabled, the string produced by the LLVM compiler 1490 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1491 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1492 1493``NT_AMD_HSA_METADATA`` 1494 Specifies extensible metadata associated with the code objects executed on HSA 1495 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1496 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1497 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1498 metadata string. 1499 1500 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1501 :name: amdgpu-elf-note-record-supported_processors-v2-table 1502 1503 ===================== ========================== 1504 Note Record ISA Name Target ID 1505 ===================== ========================== 1506 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1507 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1508 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1509 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1510 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1511 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1512 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1513 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1514 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1515 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1516 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1517 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1518 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1519 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1520 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1521 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1522 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1523 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1524 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1525 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1526 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1527 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1528 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1529 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1530 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1531 ===================== ========================== 1532 1533.. _amdgpu-note-records-v3-onwards: 1534 1535Code Object V3 and Above Note Records 1536~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1537 1538The AMDGPU backend code object uses the following ELF note record in the 1539``.note`` section when compiling for code object V3 and above. 1540 1541The note record vendor field is "AMDGPU". 1542 1543Additional note records may be present, but any which are not documented here 1544are deprecated and should not be used. 1545 1546 .. table:: AMDGPU Code Object V3 and Above ELF Note Records 1547 :name: amdgpu-elf-note-records-table-v3-onwards 1548 1549 ======== ============================== ====================================== 1550 Name Type Description 1551 ======== ============================== ====================================== 1552 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1553 binary format. 1554 ======== ============================== ====================================== 1555 1556.. 1557 1558 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values 1559 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards 1560 1561 ============================== ===== 1562 Name Value 1563 ============================== ===== 1564 *reserved* 0-31 1565 ``NT_AMDGPU_METADATA`` 32 1566 ============================== ===== 1567 1568``NT_AMDGPU_METADATA`` 1569 Specifies extensible metadata associated with an AMDGPU code object. It is 1570 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1571 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 1572 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and 1573 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the 1574 ``amdhsa`` OS. 1575 1576.. _amdgpu-symbols: 1577 1578Symbols 1579------- 1580 1581Symbols include the following: 1582 1583 .. table:: AMDGPU ELF Symbols 1584 :name: amdgpu-elf-symbols-table 1585 1586 ===================== ================== ================ ================== 1587 Name Type Section Description 1588 ===================== ================== ================ ================== 1589 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1590 - ``.rodata`` 1591 - ``.bss`` 1592 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1593 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1594 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1595 ===================== ================== ================ ================== 1596 1597Global variable 1598 Global variables both used and defined by the compilation unit. 1599 1600 If the symbol is defined in the compilation unit then it is allocated in the 1601 appropriate section according to if it has initialized data or is readonly. 1602 1603 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1604 will resolve relocations using the definition provided by another code object 1605 or explicitly defined by the runtime. 1606 1607 If the symbol resides in local/group memory (LDS) then its section is the 1608 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1609 ``st_value`` field describes alignment requirements as it does for common 1610 symbols. 1611 1612 .. TODO:: 1613 1614 Add description of linked shared object symbols. Seems undefined symbols 1615 are marked as STT_NOTYPE. 1616 1617Kernel descriptor 1618 Every HSA kernel has an associated kernel descriptor. It is the address of the 1619 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1620 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1621 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1622 1623Kernel entry point 1624 Every HSA kernel also has a symbol for its machine code entry point. 1625 1626.. _amdgpu-relocation-records: 1627 1628Relocation Records 1629------------------ 1630 1631AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1632relocatable fields are: 1633 1634``word32`` 1635 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1636 alignment. These values use the same byte order as other word values in the 1637 AMDGPU architecture. 1638 1639``word64`` 1640 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1641 alignment. These values use the same byte order as other word values in the 1642 AMDGPU architecture. 1643 1644Following notations are used for specifying relocation calculations: 1645 1646**A** 1647 Represents the addend used to compute the value of the relocatable field. 1648 1649**G** 1650 Represents the offset into the global offset table at which the relocation 1651 entry's symbol will reside during execution. 1652 1653**GOT** 1654 Represents the address of the global offset table. 1655 1656**P** 1657 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1658 of the storage unit being relocated (computed using ``r_offset``). 1659 1660**S** 1661 Represents the value of the symbol whose index resides in the relocation 1662 entry. Relocations not using this must specify a symbol index of 1663 ``STN_UNDEF``. 1664 1665**B** 1666 Represents the base address of a loaded executable or shared object which is 1667 the difference between the ELF address and the actual load address. 1668 Relocations using this are only valid in executable or shared objects. 1669 1670The following relocation types are supported: 1671 1672 .. table:: AMDGPU ELF Relocation Records 1673 :name: amdgpu-elf-relocation-records-table 1674 1675 ========================== ======= ===== ========== ============================== 1676 Relocation Type Kind Value Field Calculation 1677 ========================== ======= ===== ========== ============================== 1678 ``R_AMDGPU_NONE`` 0 *none* *none* 1679 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1680 Dynamic 1681 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1682 Dynamic 1683 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1684 Dynamic 1685 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1686 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1687 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1688 Dynamic 1689 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1690 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1691 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1692 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1693 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1694 *reserved* 12 1695 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1696 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 1697 ========================== ======= ===== ========== ============================== 1698 1699``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1700the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1701 1702There is no current OS loader support for 32-bit programs and so 1703``R_AMDGPU_ABS32`` is not used. 1704 1705.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1706 1707Loaded Code Object Path Uniform Resource Identifier (URI) 1708--------------------------------------------------------- 1709 1710The AMD GPU code object loader represents the path of the ELF shared object from 1711which the code object was loaded as a textual Uniform Resource Identifier (URI). 1712Note that the code object is the in memory loaded relocated form of the ELF 1713shared object. Multiple code objects may be loaded at different memory 1714addresses in the same process from the same ELF shared object. 1715 1716The loaded code object path URI syntax is defined by the following BNF syntax: 1717 1718.. code:: 1719 1720 code_object_uri ::== file_uri | memory_uri 1721 file_uri ::== "file://" file_path [ range_specifier ] 1722 memory_uri ::== "memory://" process_id range_specifier 1723 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1724 file_path ::== URI_ENCODED_OS_FILE_PATH 1725 process_id ::== DECIMAL_NUMBER 1726 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1727 1728**number** 1729 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1730 and octal values by "0". 1731 1732**file_path** 1733 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1734 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1735 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1736 the path are separated by "/". 1737 1738**offset** 1739 Is a 0-based byte offset to the start of the code object. For a file URI, it 1740 is from the start of the file specified by the ``file_path``, and if omitted 1741 defaults to 0. For a memory URI, it is the memory address and is required. 1742 1743**size** 1744 Is the number of bytes in the code object. For a file URI, if omitted it 1745 defaults to the size of the file. It is required for a memory URI. 1746 1747**process_id** 1748 Is the identity of the process owning the memory. For Linux it is the C 1749 unsigned integral decimal literal for the process ID (PID). 1750 1751For example: 1752 1753.. code:: 1754 1755 file:///dir1/dir2/file1 1756 file:///dir3/dir4/file2#offset=0x2000&size=3000 1757 memory://1234#offset=0x20000&size=3000 1758 1759.. _amdgpu-dwarf-debug-information: 1760 1761DWARF Debug Information 1762======================= 1763 1764.. warning:: 1765 1766 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1767 is not currently fully implemented and is subject to change. 1768 1769AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1770:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1771object executable code and data to the source language constructs. It can be 1772used by tools such as debuggers and profilers. It uses features defined in 1773:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1774DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1775 1776This section defines the AMDGPU target architecture specific DWARF mappings. 1777 1778.. _amdgpu-dwarf-register-identifier: 1779 1780Register Identifier 1781------------------- 1782 1783This section defines the AMDGPU target architecture register numbers used in 1784DWARF operation expressions (see DWARF Version 5 section 2.5 and 1785:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1786instructions (see DWARF Version 5 section 6.4 and 1787:ref:`amdgpu-dwarf-call-frame-information`). 1788 1789A single code object can contain code for kernels that have different wavefront 1790sizes. The vector registers and some scalar registers are based on the wavefront 1791size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1792simplifies the consumer of the DWARF so that each register has a fixed size, 1793rather than being dynamic according to the wavefront size mode. Similarly, 1794distinct DWARF registers are defined for those registers that vary in size 1795according to the process address size. This allows a consumer to treat a 1796specific AMDGPU processor as a single architecture regardless of how it is 1797configured at run time. The compiler explicitly specifies the DWARF registers 1798that match the mode in which the code it is generating will be executed. 1799 1800DWARF registers are encoded as numbers, which are mapped to architecture 1801registers. The mapping for AMDGPU is defined in 1802:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1803mapping. 1804 1805.. table:: AMDGPU DWARF Register Mapping 1806 :name: amdgpu-dwarf-register-mapping-table 1807 1808 ============== ================= ======== ================================== 1809 DWARF Register AMDGPU Register Bit Size Description 1810 ============== ================= ======== ================================== 1811 0 PC_32 32 Program Counter (PC) when 1812 executing in a 32-bit process 1813 address space. Used in the CFI to 1814 describe the PC of the calling 1815 frame. 1816 1 EXEC_MASK_32 32 Execution Mask Register when 1817 executing in wavefront 32 mode. 1818 2-15 *Reserved* *Reserved for highly accessed 1819 registers using DWARF shortcut.* 1820 16 PC_64 64 Program Counter (PC) when 1821 executing in a 64-bit process 1822 address space. Used in the CFI to 1823 describe the PC of the calling 1824 frame. 1825 17 EXEC_MASK_64 64 Execution Mask Register when 1826 executing in wavefront 64 mode. 1827 18-31 *Reserved* *Reserved for highly accessed 1828 registers using DWARF shortcut.* 1829 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1830 Registers. 1831 96-127 *Reserved* *Reserved for frequently accessed 1832 registers using DWARF 1-byte ULEB.* 1833 128 STATUS 32 Status Register. 1834 129-511 *Reserved* *Reserved for future Scalar 1835 Architectural Registers.* 1836 512 VCC_32 32 Vector Condition Code Register 1837 when executing in wavefront 32 1838 mode. 1839 513-767 *Reserved* *Reserved for future Vector 1840 Architectural Registers when 1841 executing in wavefront 32 mode.* 1842 768 VCC_64 64 Vector Condition Code Register 1843 when executing in wavefront 64 1844 mode. 1845 769-1023 *Reserved* *Reserved for future Vector 1846 Architectural Registers when 1847 executing in wavefront 64 mode.* 1848 1024-1087 *Reserved* *Reserved for padding.* 1849 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1850 1130-1535 *Reserved* *Reserved for future Scalar 1851 General Purpose Registers.* 1852 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1853 when executing in wavefront 32 1854 mode. 1855 1792-2047 *Reserved* *Reserved for future Vector 1856 General Purpose Registers when 1857 executing in wavefront 32 mode.* 1858 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1859 when executing in wavefront 32 1860 mode. 1861 2304-2559 *Reserved* *Reserved for future Vector 1862 Accumulation Registers when 1863 executing in wavefront 32 mode.* 1864 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1865 when executing in wavefront 64 1866 mode. 1867 2816-3071 *Reserved* *Reserved for future Vector 1868 General Purpose Registers when 1869 executing in wavefront 64 mode.* 1870 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1871 when executing in wavefront 64 1872 mode. 1873 3328-3583 *Reserved* *Reserved for future Vector 1874 Accumulation Registers when 1875 executing in wavefront 64 mode.* 1876 ============== ================= ======== ================================== 1877 1878The vector registers are represented as the full size for the wavefront. They 1879are organized as consecutive dwords (32-bits), one per lane, with the dword at 1880the least significant bit position corresponding to lane 0 and so forth. DWARF 1881location expressions involving the ``DW_OP_LLVM_offset`` and 1882``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1883register corresponding to the lane that is executing the current thread of 1884execution in languages that are implemented using a SIMD or SIMT execution 1885model. 1886 1887If the wavefront size is 32 lanes then the wavefront 32 mode register 1888definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1889mode register definitions are used. Some AMDGPU targets support executing in 1890both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1891to the wavefront mode of the generated code will be used. 1892 1893If code is generated to execute in a 32-bit process address space, then the 189432-bit process address space register definitions are used. If code is generated 1895to execute in a 64-bit process address space, then the 64-bit process address 1896space register definitions are used. The ``amdgcn`` target only supports the 189764-bit process address space. 1898 1899.. _amdgpu-dwarf-address-class-identifier: 1900 1901Address Class Identifier 1902------------------------ 1903 1904The DWARF address class represents the source language memory space. See DWARF 1905Version 5 section 2.12 which is updated by the *DWARF Extensions For 1906Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1907 1908The DWARF address class mapping used for AMDGPU is defined in 1909:ref:`amdgpu-dwarf-address-class-mapping-table`. 1910 1911.. table:: AMDGPU DWARF Address Class Mapping 1912 :name: amdgpu-dwarf-address-class-mapping-table 1913 1914 ========================= ====== ================= 1915 DWARF AMDGPU 1916 -------------------------------- ----------------- 1917 Address Class Name Value Address Space 1918 ========================= ====== ================= 1919 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1920 ``DW_ADDR_LLVM_global`` 0x0001 Global 1921 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1922 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1923 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1924 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1925 ========================= ====== ================= 1926 1927The DWARF address class values defined in the *DWARF Extensions For 1928Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1929 1930In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1931available for use for the AMD extension for access to the hardware GDS memory 1932which is scratchpad memory allocated per device. 1933 1934For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1935address class of ``DW_ADDR_none`` is used. 1936 1937See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1938mapping of DWARF address classes to DWARF address spaces, including address size 1939and NULL value. 1940 1941.. _amdgpu-dwarf-address-space-identifier: 1942 1943Address Space Identifier 1944------------------------ 1945 1946DWARF address spaces correspond to target architecture specific linear 1947addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1948For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1949 1950The DWARF address space mapping used for AMDGPU is defined in 1951:ref:`amdgpu-dwarf-address-space-mapping-table`. 1952 1953.. table:: AMDGPU DWARF Address Space Mapping 1954 :name: amdgpu-dwarf-address-space-mapping-table 1955 1956 ======================================= ===== ======= ======== ================= ======================= 1957 DWARF AMDGPU Notes 1958 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1959 Address Space Name Value Address Bit Size Address Space 1960 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1961 .. 64-bit 32-bit 1962 process process 1963 address address 1964 space space 1965 ======================================= ===== ======= ======== ================= ======================= 1966 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1967 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1968 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1969 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1970 *Reserved* 0x04 1971 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1972 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1973 ======================================= ===== ======= ======== ================= ======================= 1974 1975See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1976including address size and NULL value. 1977 1978The ``DW_ASPACE_none`` address space is the default target architecture address 1979space used in DWARF operations that do not specify an address space. It 1980therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1981related operations can refer to addresses in the program code. 1982 1983The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1984specify the flat address space. If the address corresponds to an address in the 1985local address space, then it corresponds to the wavefront that is executing the 1986focused thread of execution. If the address corresponds to an address in the 1987private address space, then it corresponds to the lane that is executing the 1988focused thread of execution for languages that are implemented using a SIMD or 1989SIMT execution model. 1990 1991.. note:: 1992 1993 CUDA-like languages such as HIP that do not have address spaces in the 1994 language type system, but do allow variables to be allocated in different 1995 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1996 address space in the DWARF expression operations as the default address space 1997 is the global address space. 1998 1999The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 2000specify the local address space corresponding to the wavefront that is executing 2001the focused thread of execution. 2002 2003The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 2004to specify the private address space corresponding to the lane that is executing 2005the focused thread of execution for languages that are implemented using a SIMD 2006or SIMT execution model. 2007 2008The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 2009to specify the unswizzled private address space corresponding to the wavefront 2010that is executing the focused thread of execution. The wavefront view of private 2011memory is the per wavefront unswizzled backing memory layout defined in 2012:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 2013location for the backing memory of the wavefront (namely the address is not 2014offset by ``wavefront-scratch-base``). The following formula can be used to 2015convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 2016``DW_ASPACE_AMDGPU_private_wave`` address: 2017 2018:: 2019 2020 private-address-wavefront = 2021 ((private-address-lane / 4) * wavefront-size * 4) + 2022 (wavefront-lane-id * 4) + (private-address-lane % 4) 2023 2024If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 2025of the dwords for each lane starting with lane 0 is required, then this 2026simplifies to: 2027 2028:: 2029 2030 private-address-wavefront = 2031 private-address-lane * wavefront-size 2032 2033A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 2034complete spilled vector register back into a complete vector register in the 2035CFI. The frame pointer can be a private lane address which is dword aligned, 2036which can be shifted to multiply by the wavefront size, and then used to form a 2037private wavefront address that gives a location for a contiguous set of dwords, 2038one per lane, where the vector register dwords are spilled. The compiler knows 2039the wavefront size since it generates the code. Note that the type of the 2040address may have to be converted as the size of a 2041``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 2042``DW_ASPACE_AMDGPU_private_wave`` address. 2043 2044.. _amdgpu-dwarf-lane-identifier: 2045 2046Lane identifier 2047--------------- 2048 2049DWARF lane identifies specify a target architecture lane position for hardware 2050that executes in a SIMD or SIMT manner, and on which a source language maps its 2051threads of execution onto those lanes. The DWARF lane identifier is pushed by 2052the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 2053section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 2054section :ref:`amdgpu-dwarf-operation-expressions`. 2055 2056For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 2057wavefront. It is numbered from 0 to the wavefront size minus 1. 2058 2059Operation Expressions 2060--------------------- 2061 2062DWARF expressions are used to compute program values and the locations of 2063program objects. See DWARF Version 5 section 2.5 and 2064:ref:`amdgpu-dwarf-operation-expressions`. 2065 2066DWARF location descriptions describe how to access storage which includes memory 2067and registers. When accessing storage on AMDGPU, bytes are ordered with least 2068significant bytes first, and bits are ordered within bytes with least 2069significant bits first. 2070 2071For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 2072unwinding vector registers that are spilled under the execution mask to memory: 2073the zero-single location description is the vector register, and the one-single 2074location description is the spilled memory location description. The 2075``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 2076memory location description. 2077 2078In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 2079``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 2080controlled by the execution mask. An undefined location description together 2081with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 2082to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 2083 2084Debugger Information Entry Attributes 2085------------------------------------- 2086 2087This section describes how certain debugger information entry attributes are 2088used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1 2089which are updated by *DWARF Extensions For Heterogeneous Debugging* section 2090:ref:`amdgpu-dwarf-low-level-information` and 2091:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`. 2092 2093.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 2094 2095``DW_AT_LLVM_lane_pc`` 2096~~~~~~~~~~~~~~~~~~~~~~ 2097 2098For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 2099location of the separate lanes of a SIMT thread. 2100 2101If the lane is an active lane then this will be the same as the current program 2102location. 2103 2104If the lane is inactive, but was active on entry to the subprogram, then this is 2105the program location in the subprogram at which execution of the lane is 2106conceptual positioned. 2107 2108If the lane was not active on entry to the subprogram, then this will be the 2109undefined location. A client debugger can check if the lane is part of a valid 2110work-group by checking that the lane is in the range of the associated 2111work-group within the grid, accounting for partial work-groups. If it is not, 2112then the debugger can omit any information for the lane. Otherwise, the debugger 2113may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 2114calling subprogram until it finds a non-undefined location. Conceptually the 2115lane only has the call frames that it has a non-undefined 2116``DW_AT_LLVM_lane_pc``. 2117 2118The following example illustrates how the AMDGPU backend can generate a DWARF 2119location list expression for the nested ``IF/THEN/ELSE`` structures of the 2120following subprogram pseudo code for a target with 64 lanes per wavefront. 2121 2122.. code:: 2123 :number-lines: 2124 2125 SUBPROGRAM X 2126 BEGIN 2127 a; 2128 IF (c1) THEN 2129 b; 2130 IF (c2) THEN 2131 c; 2132 ELSE 2133 d; 2134 ENDIF 2135 e; 2136 ELSE 2137 f; 2138 ENDIF 2139 g; 2140 END 2141 2142The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2143execution mask (``EXEC``) to linearize the control flow. The condition is 2144evaluated to make a mask of the lanes for which the condition evaluates to true. 2145First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2146logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2147``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2148the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2149region the ``EXEC`` mask is restored to the value it had at the beginning of the 2150region. This is shown below. Other approaches are possible, but the basic 2151concept is the same. 2152 2153.. code:: 2154 :number-lines: 2155 2156 $lex_start: 2157 a; 2158 %1 = EXEC 2159 %2 = c1 2160 $lex_1_start: 2161 EXEC = %1 & %2 2162 $if_1_then: 2163 b; 2164 %3 = EXEC 2165 %4 = c2 2166 $lex_1_1_start: 2167 EXEC = %3 & %4 2168 $lex_1_1_then: 2169 c; 2170 EXEC = ~EXEC & %3 2171 $lex_1_1_else: 2172 d; 2173 EXEC = %3 2174 $lex_1_1_end: 2175 e; 2176 EXEC = ~EXEC & %1 2177 $lex_1_else: 2178 f; 2179 EXEC = %1 2180 $lex_1_end: 2181 g; 2182 $lex_end: 2183 2184To create the DWARF location list expression that defines the location 2185description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2186pseudo instruction can be used to annotate the linearized control flow. This can 2187be done by defining an artificial variable for the lane PC. The DWARF location 2188list expression created for it is used as the value of the 2189``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2190 2191A DWARF procedure is defined for each well nested structured control flow region 2192which provides the conceptual lane program location for a lane if it is not 2193active (namely it is divergent). The DWARF operation expression for each region 2194conceptually inherits the value of the immediately enclosing region and modifies 2195it according to the semantics of the region. 2196 2197For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2198the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2199region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2200region since the ``THEN`` region has completed. 2201 2202The lane PC artificial variable is assigned at each region transition. It uses 2203the immediately enclosing region's DWARF procedure to compute the program 2204location for each lane assuming they are divergent, and then modifies the result 2205by inserting the current program location for each lane that the ``EXEC`` mask 2206indicates is active. 2207 2208By having separate DWARF procedures for each region, they can be reused to 2209define the value for any nested region. This reduces the total size of the DWARF 2210operation expressions. 2211 2212The following provides an example using pseudo LLVM MIR. 2213 2214.. code:: 2215 :number-lines: 2216 2217 $lex_start: 2218 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2219 DW_AT_name = "__uint64"; 2220 DW_AT_byte_size = 8; 2221 DW_AT_encoding = DW_ATE_unsigned; 2222 ]; 2223 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2224 DW_AT_name = "__active_lane_pc"; 2225 DW_AT_location = [ 2226 DW_OP_regx PC; 2227 DW_OP_LLVM_extend 64, 64; 2228 DW_OP_regval_type EXEC, %uint_64; 2229 DW_OP_LLVM_select_bit_piece 64, 64; 2230 ]; 2231 ]; 2232 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2233 DW_AT_name = "__divergent_lane_pc"; 2234 DW_AT_location = [ 2235 DW_OP_LLVM_undefined; 2236 DW_OP_LLVM_extend 64, 64; 2237 ]; 2238 ]; 2239 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2240 DW_OP_call_ref %__divergent_lane_pc; 2241 DW_OP_call_ref %__active_lane_pc; 2242 ]; 2243 a; 2244 %1 = EXEC; 2245 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2246 %2 = c1; 2247 $lex_1_start: 2248 EXEC = %1 & %2; 2249 $lex_1_then: 2250 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2251 DW_AT_name = "__divergent_lane_pc_1_then"; 2252 DW_AT_location = DIExpression[ 2253 DW_OP_call_ref %__divergent_lane_pc; 2254 DW_OP_addrx &lex_1_start; 2255 DW_OP_stack_value; 2256 DW_OP_LLVM_extend 64, 64; 2257 DW_OP_call_ref %__lex_1_save_exec; 2258 DW_OP_deref_type 64, %__uint_64; 2259 DW_OP_LLVM_select_bit_piece 64, 64; 2260 ]; 2261 ]; 2262 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2263 DW_OP_call_ref %__divergent_lane_pc_1_then; 2264 DW_OP_call_ref %__active_lane_pc; 2265 ]; 2266 b; 2267 %3 = EXEC; 2268 DBG_VALUE %3, %__lex_1_1_save_exec; 2269 %4 = c2; 2270 $lex_1_1_start: 2271 EXEC = %3 & %4; 2272 $lex_1_1_then: 2273 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2274 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2275 DW_AT_location = DIExpression[ 2276 DW_OP_call_ref %__divergent_lane_pc_1_then; 2277 DW_OP_addrx &lex_1_1_start; 2278 DW_OP_stack_value; 2279 DW_OP_LLVM_extend 64, 64; 2280 DW_OP_call_ref %__lex_1_1_save_exec; 2281 DW_OP_deref_type 64, %__uint_64; 2282 DW_OP_LLVM_select_bit_piece 64, 64; 2283 ]; 2284 ]; 2285 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2286 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2287 DW_OP_call_ref %__active_lane_pc; 2288 ]; 2289 c; 2290 EXEC = ~EXEC & %3; 2291 $lex_1_1_else: 2292 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2293 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2294 DW_AT_location = DIExpression[ 2295 DW_OP_call_ref %__divergent_lane_pc_1_then; 2296 DW_OP_addrx &lex_1_1_end; 2297 DW_OP_stack_value; 2298 DW_OP_LLVM_extend 64, 64; 2299 DW_OP_call_ref %__lex_1_1_save_exec; 2300 DW_OP_deref_type 64, %__uint_64; 2301 DW_OP_LLVM_select_bit_piece 64, 64; 2302 ]; 2303 ]; 2304 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2305 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2306 DW_OP_call_ref %__active_lane_pc; 2307 ]; 2308 d; 2309 EXEC = %3; 2310 $lex_1_1_end: 2311 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2312 DW_OP_call_ref %__divergent_lane_pc; 2313 DW_OP_call_ref %__active_lane_pc; 2314 ]; 2315 e; 2316 EXEC = ~EXEC & %1; 2317 $lex_1_else: 2318 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2319 DW_AT_name = "__divergent_lane_pc_1_else"; 2320 DW_AT_location = DIExpression[ 2321 DW_OP_call_ref %__divergent_lane_pc; 2322 DW_OP_addrx &lex_1_end; 2323 DW_OP_stack_value; 2324 DW_OP_LLVM_extend 64, 64; 2325 DW_OP_call_ref %__lex_1_save_exec; 2326 DW_OP_deref_type 64, %__uint_64; 2327 DW_OP_LLVM_select_bit_piece 64, 64; 2328 ]; 2329 ]; 2330 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2331 DW_OP_call_ref %__divergent_lane_pc_1_else; 2332 DW_OP_call_ref %__active_lane_pc; 2333 ]; 2334 f; 2335 EXEC = %1; 2336 $lex_1_end: 2337 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2338 DW_OP_call_ref %__divergent_lane_pc; 2339 DW_OP_call_ref %__active_lane_pc; 2340 ]; 2341 g; 2342 $lex_end: 2343 2344The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2345that are active, with the current program location. 2346 2347Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2348the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2349instruction, location list entries will be created that describe where the 2350artificial variables are allocated at any given program location. The compiler 2351may allocate them to registers or spill them to memory. 2352 2353The DWARF procedures for each region use the values of the saved execution mask 2354artificial variables to only update the lanes that are active on entry to the 2355region. All other lanes retain the value of the enclosing region where they were 2356last active. If they were not active on entry to the subprogram, then will have 2357the undefined location description. 2358 2359Other structured control flow regions can be handled similarly. For example, 2360loops would set the divergent program location for the region at the end of the 2361loop. Any lanes active will be in the loop, and any lanes not active must have 2362exited the loop. 2363 2364An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2365``IF/THEN/ELSE`` regions. 2366 2367The DWARF procedures can use the active lane artificial variable described in 2368:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2369``EXEC`` mask in order to support whole or quad wavefront mode. 2370 2371.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2372 2373``DW_AT_LLVM_active_lane`` 2374~~~~~~~~~~~~~~~~~~~~~~~~~~ 2375 2376The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2377entry is used to specify the lanes that are conceptually active for a SIMT 2378thread. 2379 2380The execution mask may be modified to implement whole or quad wavefront mode 2381operations. For example, all lanes may need to temporarily be made active to 2382execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2383update it to enable the necessary lanes, perform the operations, and then 2384restore the ``EXEC`` mask from the saved value. While executing the whole 2385wavefront region, the conceptual execution mask is the saved value, not the 2386``EXEC`` value. 2387 2388This is handled by defining an artificial variable for the active lane mask. The 2389active lane mask artificial variable would be the actual ``EXEC`` mask for 2390normal regions, and the saved execution mask for regions where the mask is 2391temporarily updated. The location list expression created for this artificial 2392variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2393attribute. 2394 2395``DW_AT_LLVM_augmentation`` 2396~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2397 2398For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2399debugger information entry has the following value for the augmentation string: 2400 2401:: 2402 2403 [amdgpu:v0.0] 2404 2405The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2406extensions used in the DWARF of the compilation unit. The version number 2407conforms to [SEMVER]_. 2408 2409Call Frame Information 2410---------------------- 2411 2412DWARF Call Frame Information (CFI) describes how a consumer can virtually 2413*unwind* call frames in a running process or core dump. See DWARF Version 5 2414section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2415 2416For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2417 24181. ``augmentation`` string contains the following null-terminated UTF-8 string: 2419 2420 :: 2421 2422 [amd:v0.0] 2423 2424 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2425 extensions used in this CIE or to the FDEs that use it. The version number 2426 conforms to [SEMVER]_. 2427 24282. ``address_size`` for the ``Global`` address space is defined in 2429 :ref:`amdgpu-dwarf-address-space-identifier`. 2430 24313. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2432 24334. ``code_alignment_factor`` is 4 bytes. 2434 2435 .. TODO:: 2436 2437 Add to :ref:`amdgpu-processor-table` table. 2438 24395. ``data_alignment_factor`` is 4 bytes. 2440 2441 .. TODO:: 2442 2443 Add to :ref:`amdgpu-processor-table` table. 2444 24456. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2446 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2447 24487. ``initial_instructions`` Since a subprogram X with fewer registers can be 2449 called from subprogram Y that has more allocated, X will not change any of 2450 the extra registers as it cannot access them. Therefore, the default rule 2451 for all columns is ``same value``. 2452 2453For AMDGPU the register number follows the numbering defined in 2454:ref:`amdgpu-dwarf-register-identifier`. 2455 2456For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2457the return address to get the address of a byte within the call site 2458instructions. See DWARF Version 5 section 6.4.4. 2459 2460Accelerated Access 2461------------------ 2462 2463See DWARF Version 5 section 6.1. 2464 2465Lookup By Name Section Header 2466~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2467 2468See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2469 2470For AMDGPU the lookup by name section header table: 2471 2472``augmentation_string_size`` (uword) 2473 2474 Set to the length of the ``augmentation_string`` value which is always a 2475 multiple of 4. 2476 2477``augmentation_string`` (sequence of UTF-8 characters) 2478 2479 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2480 2481 :: 2482 2483 [amdgpu:v0.0] 2484 2485 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2486 extensions used in the DWARF of this index. The version number conforms to 2487 [SEMVER]_. 2488 2489 .. note:: 2490 2491 This is different to the DWARF Version 5 definition that requires the first 2492 4 characters to be the vendor ID. But this is consistent with the other 2493 augmentation strings and does allow multiple vendor contributions. However, 2494 backwards compatibility may be more desirable. 2495 2496Lookup By Address Section Header 2497~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2498 2499See DWARF Version 5 section 6.1.2. 2500 2501For AMDGPU the lookup by address section header table: 2502 2503``address_size`` (ubyte) 2504 2505 Match the address size for the ``Global`` address space defined in 2506 :ref:`amdgpu-dwarf-address-space-identifier`. 2507 2508``segment_selector_size`` (ubyte) 2509 2510 AMDGPU does not use a segment selector so this is 0. The entries in the 2511 ``.debug_aranges`` do not have a segment selector. 2512 2513Line Number Information 2514----------------------- 2515 2516See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2517 2518AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2519The instruction set must be obtained from the ELF file header ``e_flags`` field 2520in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2521<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2522 2523.. TODO:: 2524 2525 Should the ``isa`` state machine register be used to indicate if the code is 2526 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2527 2528For AMDGPU the line number program header fields have the following values (see 2529DWARF Version 5 section 6.2.4): 2530 2531``address_size`` (ubyte) 2532 Matches the address size for the ``Global`` address space defined in 2533 :ref:`amdgpu-dwarf-address-space-identifier`. 2534 2535``segment_selector_size`` (ubyte) 2536 AMDGPU does not use a segment selector so this is 0. 2537 2538``minimum_instruction_length`` (ubyte) 2539 For GFX9-GFX11 this is 4. 2540 2541``maximum_operations_per_instruction`` (ubyte) 2542 For GFX9-GFX11 this is 1. 2543 2544Source text for online-compiled programs (for example, those compiled by the 2545OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2546See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2547Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2548<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2549 2550The Clang option used to control source embedding in AMDGPU is defined in 2551:ref:`amdgpu-clang-debug-options-table`. 2552 2553 .. table:: AMDGPU Clang Debug Options 2554 :name: amdgpu-clang-debug-options-table 2555 2556 ==================== ================================================== 2557 Debug Flag Description 2558 ==================== ================================================== 2559 -g[no-]embed-source Enable/disable embedding source text in DWARF 2560 debug sections. Useful for environments where 2561 source cannot be written to disk, such as 2562 when performing online compilation. 2563 ==================== ================================================== 2564 2565For example: 2566 2567``-gembed-source`` 2568 Enable the embedded source. 2569 2570``-gno-embed-source`` 2571 Disable the embedded source. 2572 257332-Bit and 64-Bit DWARF Formats 2574------------------------------- 2575 2576See DWARF Version 5 section 7.4 and 2577:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2578 2579For AMDGPU: 2580 2581* For the ``amdgcn`` target architecture only the 64-bit process address space 2582 is supported. 2583 2584* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2585 the 32-bit DWARF format. 2586 2587Unit Headers 2588------------ 2589 2590For AMDGPU the following values apply for each of the unit headers described in 2591DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2592 2593``address_size`` (ubyte) 2594 Matches the address size for the ``Global`` address space defined in 2595 :ref:`amdgpu-dwarf-address-space-identifier`. 2596 2597.. _amdgpu-code-conventions: 2598 2599Code Conventions 2600================ 2601 2602This section provides code conventions used for each supported target triple OS 2603(see :ref:`amdgpu-target-triples`). 2604 2605AMDHSA 2606------ 2607 2608This section provides code conventions used when the target triple OS is 2609``amdhsa`` (see :ref:`amdgpu-target-triples`). 2610 2611.. _amdgpu-amdhsa-code-object-metadata: 2612 2613Code Object Metadata 2614~~~~~~~~~~~~~~~~~~~~ 2615 2616The code object metadata specifies extensible metadata associated with the code 2617objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2618encoding and semantics of this metadata depends on the code object version; see 2619:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2620:ref:`amdgpu-amdhsa-code-object-metadata-v3`, 2621:ref:`amdgpu-amdhsa-code-object-metadata-v4` and 2622:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 2623 2624Code object metadata is specified in a note record (see 2625:ref:`amdgpu-note-records`) and is required when the target triple OS is 2626``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2627information necessary to support the HSA compatible runtime kernel queries. For 2628example, the segment sizes needed in a dispatch packet. In addition, a 2629high-level language runtime may require other information to be included. For 2630example, the AMD OpenCL runtime records kernel argument information. 2631 2632.. _amdgpu-amdhsa-code-object-metadata-v2: 2633 2634Code Object V2 Metadata 2635+++++++++++++++++++++++ 2636 2637.. warning:: 2638 Code object V2 is not the default code object version emitted by this version 2639 of LLVM. 2640 2641Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2642(see :ref:`amdgpu-note-records-v2`). 2643 2644The metadata is specified as a YAML formatted string (see [YAML]_ and 2645:doc:`YamlIO`). 2646 2647.. TODO:: 2648 2649 Is the string null terminated? It probably should not if YAML allows it to 2650 contain null characters, otherwise it should be. 2651 2652The metadata is represented as a single YAML document comprised of the mapping 2653defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2654referenced tables. 2655 2656For boolean values, the string values of ``false`` and ``true`` are used for 2657false and true respectively. 2658 2659Additional information can be added to the mappings. To avoid conflicts, any 2660non-AMD key names should be prefixed by "*vendor-name*.". 2661 2662 .. table:: AMDHSA Code Object V2 Metadata Map 2663 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2664 2665 ========== ============== ========= ======================================= 2666 String Key Value Type Required? Description 2667 ========== ============== ========= ======================================= 2668 "Version" sequence of Required - The first integer is the major 2669 2 integers version. Currently 1. 2670 - The second integer is the minor 2671 version. Currently 0. 2672 "Printf" sequence of Each string is encoded information 2673 strings about a printf function call. The 2674 encoded information is organized as 2675 fields separated by colon (':'): 2676 2677 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2678 2679 where: 2680 2681 ``ID`` 2682 A 32-bit integer as a unique id for 2683 each printf function call 2684 2685 ``N`` 2686 A 32-bit integer equal to the number 2687 of arguments of printf function call 2688 minus 1 2689 2690 ``S[i]`` (where i = 0, 1, ... , N-1) 2691 32-bit integers for the size in bytes 2692 of the i-th FormatString argument of 2693 the printf function call 2694 2695 FormatString 2696 The format string passed to the 2697 printf function call. 2698 "Kernels" sequence of Required Sequence of the mappings for each 2699 mapping kernel in the code object. See 2700 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2701 for the definition of the mapping. 2702 ========== ============== ========= ======================================= 2703 2704.. 2705 2706 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2707 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2708 2709 ================= ============== ========= ================================ 2710 String Key Value Type Required? Description 2711 ================= ============== ========= ================================ 2712 "Name" string Required Source name of the kernel. 2713 "SymbolName" string Required Name of the kernel 2714 descriptor ELF symbol. 2715 "Language" string Source language of the kernel. 2716 Values include: 2717 2718 - "OpenCL C" 2719 - "OpenCL C++" 2720 - "HCC" 2721 - "OpenMP" 2722 2723 "LanguageVersion" sequence of - The first integer is the major 2724 2 integers version. 2725 - The second integer is the 2726 minor version. 2727 "Attrs" mapping Mapping of kernel attributes. 2728 See 2729 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2730 for the mapping definition. 2731 "Args" sequence of Sequence of mappings of the 2732 mapping kernel arguments. See 2733 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2734 for the definition of the mapping. 2735 "CodeProps" mapping Mapping of properties related to 2736 the kernel code. See 2737 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2738 for the mapping definition. 2739 ================= ============== ========= ================================ 2740 2741.. 2742 2743 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2744 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2745 2746 =================== ============== ========= ============================== 2747 String Key Value Type Required? Description 2748 =================== ============== ========= ============================== 2749 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2750 3 integers must be >=1 and the dispatch 2751 work-group size X, Y, Z must 2752 correspond to the specified 2753 values. Defaults to 0, 0, 0. 2754 2755 Corresponds to the OpenCL 2756 ``reqd_work_group_size`` 2757 attribute. 2758 "WorkGroupSizeHint" sequence of The dispatch work-group size 2759 3 integers X, Y, Z is likely to be the 2760 specified values. 2761 2762 Corresponds to the OpenCL 2763 ``work_group_size_hint`` 2764 attribute. 2765 "VecTypeHint" string The name of a scalar or vector 2766 type. 2767 2768 Corresponds to the OpenCL 2769 ``vec_type_hint`` attribute. 2770 2771 "RuntimeHandle" string The external symbol name 2772 associated with a kernel. 2773 OpenCL runtime allocates a 2774 global buffer for the symbol 2775 and saves the kernel's address 2776 to it, which is used for 2777 device side enqueueing. Only 2778 available for device side 2779 enqueued kernels. 2780 =================== ============== ========= ============================== 2781 2782.. 2783 2784 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2785 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2786 2787 ================= ============== ========= ================================ 2788 String Key Value Type Required? Description 2789 ================= ============== ========= ================================ 2790 "Name" string Kernel argument name. 2791 "TypeName" string Kernel argument type name. 2792 "Size" integer Required Kernel argument size in bytes. 2793 "Align" integer Required Kernel argument alignment in 2794 bytes. Must be a power of two. 2795 "ValueKind" string Required Kernel argument kind that 2796 specifies how to set up the 2797 corresponding argument. 2798 Values include: 2799 2800 "ByValue" 2801 The argument is copied 2802 directly into the kernarg. 2803 2804 "GlobalBuffer" 2805 A global address space pointer 2806 to the buffer data is passed 2807 in the kernarg. 2808 2809 "DynamicSharedPointer" 2810 A group address space pointer 2811 to dynamically allocated LDS 2812 is passed in the kernarg. 2813 2814 "Sampler" 2815 A global address space 2816 pointer to a S# is passed in 2817 the kernarg. 2818 2819 "Image" 2820 A global address space 2821 pointer to a T# is passed in 2822 the kernarg. 2823 2824 "Pipe" 2825 A global address space pointer 2826 to an OpenCL pipe is passed in 2827 the kernarg. 2828 2829 "Queue" 2830 A global address space pointer 2831 to an OpenCL device enqueue 2832 queue is passed in the 2833 kernarg. 2834 2835 "HiddenGlobalOffsetX" 2836 The OpenCL grid dispatch 2837 global offset for the X 2838 dimension is passed in the 2839 kernarg. 2840 2841 "HiddenGlobalOffsetY" 2842 The OpenCL grid dispatch 2843 global offset for the Y 2844 dimension is passed in the 2845 kernarg. 2846 2847 "HiddenGlobalOffsetZ" 2848 The OpenCL grid dispatch 2849 global offset for the Z 2850 dimension is passed in the 2851 kernarg. 2852 2853 "HiddenNone" 2854 An argument that is not used 2855 by the kernel. Space needs to 2856 be left for it, but it does 2857 not need to be set up. 2858 2859 "HiddenPrintfBuffer" 2860 A global address space pointer 2861 to the runtime printf buffer 2862 is passed in kernarg. Mutually 2863 exclusive with 2864 "HiddenHostcallBuffer". 2865 2866 "HiddenHostcallBuffer" 2867 A global address space pointer 2868 to the runtime hostcall buffer 2869 is passed in kernarg. Mutually 2870 exclusive with 2871 "HiddenPrintfBuffer". 2872 2873 "HiddenDefaultQueue" 2874 A global address space pointer 2875 to the OpenCL device enqueue 2876 queue that should be used by 2877 the kernel by default is 2878 passed in the kernarg. 2879 2880 "HiddenCompletionAction" 2881 A global address space pointer 2882 to help link enqueued kernels into 2883 the ancestor tree for determining 2884 when the parent kernel has finished. 2885 2886 "HiddenMultiGridSyncArg" 2887 A global address space pointer for 2888 multi-grid synchronization is 2889 passed in the kernarg. 2890 2891 "ValueType" string Unused and deprecated. This should no longer 2892 be emitted, but is accepted for compatibility. 2893 2894 2895 "PointeeAlign" integer Alignment in bytes of pointee 2896 type for pointer type kernel 2897 argument. Must be a power 2898 of 2. Only present if 2899 "ValueKind" is 2900 "DynamicSharedPointer". 2901 "AddrSpaceQual" string Kernel argument address space 2902 qualifier. Only present if 2903 "ValueKind" is "GlobalBuffer" or 2904 "DynamicSharedPointer". Values 2905 are: 2906 2907 - "Private" 2908 - "Global" 2909 - "Constant" 2910 - "Local" 2911 - "Generic" 2912 - "Region" 2913 2914 .. TODO:: 2915 2916 Is GlobalBuffer only Global 2917 or Constant? Is 2918 DynamicSharedPointer always 2919 Local? Can HCC allow Generic? 2920 How can Private or Region 2921 ever happen? 2922 2923 "AccQual" string Kernel argument access 2924 qualifier. Only present if 2925 "ValueKind" is "Image" or 2926 "Pipe". Values 2927 are: 2928 2929 - "ReadOnly" 2930 - "WriteOnly" 2931 - "ReadWrite" 2932 2933 .. TODO:: 2934 2935 Does this apply to 2936 GlobalBuffer? 2937 2938 "ActualAccQual" string The actual memory accesses 2939 performed by the kernel on the 2940 kernel argument. Only present if 2941 "ValueKind" is "GlobalBuffer", 2942 "Image", or "Pipe". This may be 2943 more restrictive than indicated 2944 by "AccQual" to reflect what the 2945 kernel actual does. If not 2946 present then the runtime must 2947 assume what is implied by 2948 "AccQual" and "IsConst". Values 2949 are: 2950 2951 - "ReadOnly" 2952 - "WriteOnly" 2953 - "ReadWrite" 2954 2955 "IsConst" boolean Indicates if the kernel argument 2956 is const qualified. Only present 2957 if "ValueKind" is 2958 "GlobalBuffer". 2959 2960 "IsRestrict" boolean Indicates if the kernel argument 2961 is restrict qualified. Only 2962 present if "ValueKind" is 2963 "GlobalBuffer". 2964 2965 "IsVolatile" boolean Indicates if the kernel argument 2966 is volatile qualified. Only 2967 present if "ValueKind" is 2968 "GlobalBuffer". 2969 2970 "IsPipe" boolean Indicates if the kernel argument 2971 is pipe qualified. Only present 2972 if "ValueKind" is "Pipe". 2973 2974 .. TODO:: 2975 2976 Can GlobalBuffer be pipe 2977 qualified? 2978 2979 ================= ============== ========= ================================ 2980 2981.. 2982 2983 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2984 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2985 2986 ============================ ============== ========= ===================== 2987 String Key Value Type Required? Description 2988 ============================ ============== ========= ===================== 2989 "KernargSegmentSize" integer Required The size in bytes of 2990 the kernarg segment 2991 that holds the values 2992 of the arguments to 2993 the kernel. 2994 "GroupSegmentFixedSize" integer Required The amount of group 2995 segment memory 2996 required by a 2997 work-group in 2998 bytes. This does not 2999 include any 3000 dynamically allocated 3001 group segment memory 3002 that may be added 3003 when the kernel is 3004 dispatched. 3005 "PrivateSegmentFixedSize" integer Required The amount of fixed 3006 private address space 3007 memory required for a 3008 work-item in 3009 bytes. If the kernel 3010 uses a dynamic call 3011 stack then additional 3012 space must be added 3013 to this value for the 3014 call stack. 3015 "KernargSegmentAlign" integer Required The maximum byte 3016 alignment of 3017 arguments in the 3018 kernarg segment. Must 3019 be a power of 2. 3020 "WavefrontSize" integer Required Wavefront size. Must 3021 be a power of 2. 3022 "NumSGPRs" integer Required Number of scalar 3023 registers used by a 3024 wavefront for 3025 GFX6-GFX11. This 3026 includes the special 3027 SGPRs for VCC, Flat 3028 Scratch (GFX7-GFX10) 3029 and XNACK (for 3030 GFX8-GFX10). It does 3031 not include the 16 3032 SGPR added if a trap 3033 handler is 3034 enabled. It is not 3035 rounded up to the 3036 allocation 3037 granularity. 3038 "NumVGPRs" integer Required Number of vector 3039 registers used by 3040 each work-item for 3041 GFX6-GFX11 3042 "MaxFlatWorkGroupSize" integer Required Maximum flat 3043 work-group size 3044 supported by the 3045 kernel in work-items. 3046 Must be >=1 and 3047 consistent with 3048 ReqdWorkGroupSize if 3049 not 0, 0, 0. 3050 "NumSpilledSGPRs" integer Number of stores from 3051 a scalar register to 3052 a register allocator 3053 created spill 3054 location. 3055 "NumSpilledVGPRs" integer Number of stores from 3056 a vector register to 3057 a register allocator 3058 created spill 3059 location. 3060 ============================ ============== ========= ===================== 3061 3062.. _amdgpu-amdhsa-code-object-metadata-v3: 3063 3064Code Object V3 Metadata 3065+++++++++++++++++++++++ 3066 3067.. warning:: 3068 Code object V3 is not the default code object version emitted by this version 3069 of LLVM. 3070 3071Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note 3072record (see :ref:`amdgpu-note-records-v3-onwards`). 3073 3074The metadata is represented as Message Pack formatted binary data (see 3075[MsgPack]_). The top level is a Message Pack map that includes the 3076keys defined in table 3077:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 3078tables. 3079 3080Additional information can be added to the maps. To avoid conflicts, 3081any key names should be prefixed by "*vendor-name*." where 3082``vendor-name`` can be the name of the vendor and specific vendor 3083tool that generates the information. The prefix is abbreviated to 3084simply "." when it appears within a map that has been added by the 3085same *vendor-name*. 3086 3087 .. table:: AMDHSA Code Object V3 Metadata Map 3088 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 3089 3090 ================= ============== ========= ======================================= 3091 String Key Value Type Required? Description 3092 ================= ============== ========= ======================================= 3093 "amdhsa.version" sequence of Required - The first integer is the major 3094 2 integers version. Currently 1. 3095 - The second integer is the minor 3096 version. Currently 0. 3097 "amdhsa.printf" sequence of Each string is encoded information 3098 strings about a printf function call. The 3099 encoded information is organized as 3100 fields separated by colon (':'): 3101 3102 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 3103 3104 where: 3105 3106 ``ID`` 3107 A 32-bit integer as a unique id for 3108 each printf function call 3109 3110 ``N`` 3111 A 32-bit integer equal to the number 3112 of arguments of printf function call 3113 minus 1 3114 3115 ``S[i]`` (where i = 0, 1, ... , N-1) 3116 32-bit integers for the size in bytes 3117 of the i-th FormatString argument of 3118 the printf function call 3119 3120 FormatString 3121 The format string passed to the 3122 printf function call. 3123 "amdhsa.kernels" sequence of Required Sequence of the maps for each 3124 map kernel in the code object. See 3125 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 3126 for the definition of the keys included 3127 in that map. 3128 ================= ============== ========= ======================================= 3129 3130.. 3131 3132 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 3133 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 3134 3135 =================================== ============== ========= ================================ 3136 String Key Value Type Required? Description 3137 =================================== ============== ========= ================================ 3138 ".name" string Required Source name of the kernel. 3139 ".symbol" string Required Name of the kernel 3140 descriptor ELF symbol. 3141 ".language" string Source language of the kernel. 3142 Values include: 3143 3144 - "OpenCL C" 3145 - "OpenCL C++" 3146 - "HCC" 3147 - "HIP" 3148 - "OpenMP" 3149 - "Assembler" 3150 3151 ".language_version" sequence of - The first integer is the major 3152 2 integers version. 3153 - The second integer is the 3154 minor version. 3155 ".args" sequence of Sequence of maps of the 3156 map kernel arguments. See 3157 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3158 for the definition of the keys 3159 included in that map. 3160 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3161 3 integers must be >=1 and the dispatch 3162 work-group size X, Y, Z must 3163 correspond to the specified 3164 values. Defaults to 0, 0, 0. 3165 3166 Corresponds to the OpenCL 3167 ``reqd_work_group_size`` 3168 attribute. 3169 ".workgroup_size_hint" sequence of The dispatch work-group size 3170 3 integers X, Y, Z is likely to be the 3171 specified values. 3172 3173 Corresponds to the OpenCL 3174 ``work_group_size_hint`` 3175 attribute. 3176 ".vec_type_hint" string The name of a scalar or vector 3177 type. 3178 3179 Corresponds to the OpenCL 3180 ``vec_type_hint`` attribute. 3181 3182 ".device_enqueue_symbol" string The external symbol name 3183 associated with a kernel. 3184 OpenCL runtime allocates a 3185 global buffer for the symbol 3186 and saves the kernel's address 3187 to it, which is used for 3188 device side enqueueing. Only 3189 available for device side 3190 enqueued kernels. 3191 ".kernarg_segment_size" integer Required The size in bytes of 3192 the kernarg segment 3193 that holds the values 3194 of the arguments to 3195 the kernel. 3196 ".group_segment_fixed_size" integer Required The amount of group 3197 segment memory 3198 required by a 3199 work-group in 3200 bytes. This does not 3201 include any 3202 dynamically allocated 3203 group segment memory 3204 that may be added 3205 when the kernel is 3206 dispatched. 3207 ".private_segment_fixed_size" integer Required The amount of fixed 3208 private address space 3209 memory required for a 3210 work-item in 3211 bytes. If the kernel 3212 uses a dynamic call 3213 stack then additional 3214 space must be added 3215 to this value for the 3216 call stack. 3217 ".kernarg_segment_align" integer Required The maximum byte 3218 alignment of 3219 arguments in the 3220 kernarg segment. Must 3221 be a power of 2. 3222 ".wavefront_size" integer Required Wavefront size. Must 3223 be a power of 2. 3224 ".sgpr_count" integer Required Number of scalar 3225 registers required by a 3226 wavefront for 3227 GFX6-GFX9. A register 3228 is required if it is 3229 used explicitly, or 3230 if a higher numbered 3231 register is used 3232 explicitly. This 3233 includes the special 3234 SGPRs for VCC, Flat 3235 Scratch (GFX7-GFX9) 3236 and XNACK (for 3237 GFX8-GFX9). It does 3238 not include the 16 3239 SGPR added if a trap 3240 handler is 3241 enabled. It is not 3242 rounded up to the 3243 allocation 3244 granularity. 3245 ".vgpr_count" integer Required Number of vector 3246 registers required by 3247 each work-item for 3248 GFX6-GFX9. A register 3249 is required if it is 3250 used explicitly, or 3251 if a higher numbered 3252 register is used 3253 explicitly. 3254 ".agpr_count" integer Required Number of accumulator 3255 registers required by 3256 each work-item for 3257 GFX90A, GFX908. 3258 ".max_flat_workgroup_size" integer Required Maximum flat 3259 work-group size 3260 supported by the 3261 kernel in work-items. 3262 Must be >=1 and 3263 consistent with 3264 ReqdWorkGroupSize if 3265 not 0, 0, 0. 3266 ".sgpr_spill_count" integer Number of stores from 3267 a scalar register to 3268 a register allocator 3269 created spill 3270 location. 3271 ".vgpr_spill_count" integer Number of stores from 3272 a vector register to 3273 a register allocator 3274 created spill 3275 location. 3276 ".kind" string The kind of the kernel 3277 with the following 3278 values: 3279 3280 "normal" 3281 Regular kernels. 3282 3283 "init" 3284 These kernels must be 3285 invoked after loading 3286 the containing code 3287 object and must 3288 complete before any 3289 normal and fini 3290 kernels in the same 3291 code object are 3292 invoked. 3293 3294 "fini" 3295 These kernels must be 3296 invoked before 3297 unloading the 3298 containing code object 3299 and after all init and 3300 normal kernels in the 3301 same code object have 3302 been invoked and 3303 completed. 3304 3305 If omitted, "normal" is 3306 assumed. 3307 =================================== ============== ========= ================================ 3308 3309.. 3310 3311 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3312 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3313 3314 ====================== ============== ========= ================================ 3315 String Key Value Type Required? Description 3316 ====================== ============== ========= ================================ 3317 ".name" string Kernel argument name. 3318 ".type_name" string Kernel argument type name. 3319 ".size" integer Required Kernel argument size in bytes. 3320 ".offset" integer Required Kernel argument offset in 3321 bytes. The offset must be a 3322 multiple of the alignment 3323 required by the argument. 3324 ".value_kind" string Required Kernel argument kind that 3325 specifies how to set up the 3326 corresponding argument. 3327 Values include: 3328 3329 "by_value" 3330 The argument is copied 3331 directly into the kernarg. 3332 3333 "global_buffer" 3334 A global address space pointer 3335 to the buffer data is passed 3336 in the kernarg. 3337 3338 "dynamic_shared_pointer" 3339 A group address space pointer 3340 to dynamically allocated LDS 3341 is passed in the kernarg. 3342 3343 "sampler" 3344 A global address space 3345 pointer to a S# is passed in 3346 the kernarg. 3347 3348 "image" 3349 A global address space 3350 pointer to a T# is passed in 3351 the kernarg. 3352 3353 "pipe" 3354 A global address space pointer 3355 to an OpenCL pipe is passed in 3356 the kernarg. 3357 3358 "queue" 3359 A global address space pointer 3360 to an OpenCL device enqueue 3361 queue is passed in the 3362 kernarg. 3363 3364 "hidden_global_offset_x" 3365 The OpenCL grid dispatch 3366 global offset for the X 3367 dimension is passed in the 3368 kernarg. 3369 3370 "hidden_global_offset_y" 3371 The OpenCL grid dispatch 3372 global offset for the Y 3373 dimension is passed in the 3374 kernarg. 3375 3376 "hidden_global_offset_z" 3377 The OpenCL grid dispatch 3378 global offset for the Z 3379 dimension is passed in the 3380 kernarg. 3381 3382 "hidden_none" 3383 An argument that is not used 3384 by the kernel. Space needs to 3385 be left for it, but it does 3386 not need to be set up. 3387 3388 "hidden_printf_buffer" 3389 A global address space pointer 3390 to the runtime printf buffer 3391 is passed in kernarg. Mutually 3392 exclusive with 3393 "hidden_hostcall_buffer" 3394 before Code Object V5. 3395 3396 "hidden_hostcall_buffer" 3397 A global address space pointer 3398 to the runtime hostcall buffer 3399 is passed in kernarg. Mutually 3400 exclusive with 3401 "hidden_printf_buffer" 3402 before Code Object V5. 3403 3404 "hidden_default_queue" 3405 A global address space pointer 3406 to the OpenCL device enqueue 3407 queue that should be used by 3408 the kernel by default is 3409 passed in the kernarg. 3410 3411 "hidden_completion_action" 3412 A global address space pointer 3413 to help link enqueued kernels into 3414 the ancestor tree for determining 3415 when the parent kernel has finished. 3416 3417 "hidden_multigrid_sync_arg" 3418 A global address space pointer for 3419 multi-grid synchronization is 3420 passed in the kernarg. 3421 3422 ".value_type" string Unused and deprecated. This should no longer 3423 be emitted, but is accepted for compatibility. 3424 3425 ".pointee_align" integer Alignment in bytes of pointee 3426 type for pointer type kernel 3427 argument. Must be a power 3428 of 2. Only present if 3429 ".value_kind" is 3430 "dynamic_shared_pointer". 3431 ".address_space" string Kernel argument address space 3432 qualifier. Only present if 3433 ".value_kind" is "global_buffer" or 3434 "dynamic_shared_pointer". Values 3435 are: 3436 3437 - "private" 3438 - "global" 3439 - "constant" 3440 - "local" 3441 - "generic" 3442 - "region" 3443 3444 .. TODO:: 3445 3446 Is "global_buffer" only "global" 3447 or "constant"? Is 3448 "dynamic_shared_pointer" always 3449 "local"? Can HCC allow "generic"? 3450 How can "private" or "region" 3451 ever happen? 3452 3453 ".access" string Kernel argument access 3454 qualifier. Only present if 3455 ".value_kind" is "image" or 3456 "pipe". Values 3457 are: 3458 3459 - "read_only" 3460 - "write_only" 3461 - "read_write" 3462 3463 .. TODO:: 3464 3465 Does this apply to 3466 "global_buffer"? 3467 3468 ".actual_access" string The actual memory accesses 3469 performed by the kernel on the 3470 kernel argument. Only present if 3471 ".value_kind" is "global_buffer", 3472 "image", or "pipe". This may be 3473 more restrictive than indicated 3474 by ".access" to reflect what the 3475 kernel actual does. If not 3476 present then the runtime must 3477 assume what is implied by 3478 ".access" and ".is_const" . Values 3479 are: 3480 3481 - "read_only" 3482 - "write_only" 3483 - "read_write" 3484 3485 ".is_const" boolean Indicates if the kernel argument 3486 is const qualified. Only present 3487 if ".value_kind" is 3488 "global_buffer". 3489 3490 ".is_restrict" boolean Indicates if the kernel argument 3491 is restrict qualified. Only 3492 present if ".value_kind" is 3493 "global_buffer". 3494 3495 ".is_volatile" boolean Indicates if the kernel argument 3496 is volatile qualified. Only 3497 present if ".value_kind" is 3498 "global_buffer". 3499 3500 ".is_pipe" boolean Indicates if the kernel argument 3501 is pipe qualified. Only present 3502 if ".value_kind" is "pipe". 3503 3504 .. TODO:: 3505 3506 Can "global_buffer" be pipe 3507 qualified? 3508 3509 ====================== ============== ========= ================================ 3510 3511.. _amdgpu-amdhsa-code-object-metadata-v4: 3512 3513Code Object V4 Metadata 3514+++++++++++++++++++++++ 3515 3516Code object V4 metadata is the same as 3517:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3518defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`. 3519 3520 .. table:: AMDHSA Code Object V4 Metadata Map Changes 3521 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3522 3523 ================= ============== ========= ======================================= 3524 String Key Value Type Required? Description 3525 ================= ============== ========= ======================================= 3526 "amdhsa.version" sequence of Required - The first integer is the major 3527 2 integers version. Currently 1. 3528 - The second integer is the minor 3529 version. Currently 1. 3530 "amdhsa.target" string Required The target name of the code using the syntax: 3531 3532 .. code:: 3533 3534 <target-triple> [ "-" <target-id> ] 3535 3536 A canonical target ID must be 3537 used. See :ref:`amdgpu-target-triples` 3538 and :ref:`amdgpu-target-id`. 3539 ================= ============== ========= ======================================= 3540 3541.. _amdgpu-amdhsa-code-object-metadata-v5: 3542 3543Code Object V5 Metadata 3544+++++++++++++++++++++++ 3545 3546.. warning:: 3547 Code object V5 is not the default code object version emitted by this version 3548 of LLVM. 3549 3550 3551Code object V5 metadata is the same as 3552:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table 3553:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table 3554:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`. 3555 3556 .. table:: AMDHSA Code Object V5 Metadata Map Changes 3557 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5 3558 3559 ================= ============== ========= ======================================= 3560 String Key Value Type Required? Description 3561 ================= ============== ========= ======================================= 3562 "amdhsa.version" sequence of Required - The first integer is the major 3563 2 integers version. Currently 1. 3564 - The second integer is the minor 3565 version. Currently 2. 3566 ================= ============== ========= ======================================= 3567 3568.. 3569 3570 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes 3571 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5 3572 3573 ====================== ============== ========= ================================ 3574 String Key Value Type Required? Description 3575 ====================== ============== ========= ================================ 3576 ".value_kind" string Required Kernel argument kind that 3577 specifies how to set up the 3578 corresponding argument. 3579 Values include: 3580 the same as code object V3 metadata 3581 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`) 3582 with the following additions: 3583 3584 "hidden_block_count_x" 3585 The grid dispatch work-group count for the X dimension 3586 is passed in the kernarg. Some languages, such as OpenCL, 3587 support a last work-group in each dimension being partial. 3588 This count only includes the non-partial work-group count. 3589 This is not the same as the value in the AQL dispatch packet, 3590 which has the grid size in work-items. 3591 3592 "hidden_block_count_y" 3593 The grid dispatch work-group count for the Y dimension 3594 is passed in the kernarg. Some languages, such as OpenCL, 3595 support a last work-group in each dimension being partial. 3596 This count only includes the non-partial work-group count. 3597 This is not the same as the value in the AQL dispatch packet, 3598 which has the grid size in work-items. If the grid dimensionality 3599 is 1, then must be 1. 3600 3601 "hidden_block_count_z" 3602 The grid dispatch work-group count for the Z dimension 3603 is passed in the kernarg. Some languages, such as OpenCL, 3604 support a last work-group in each dimension being partial. 3605 This count only includes the non-partial work-group count. 3606 This is not the same as the value in the AQL dispatch packet, 3607 which has the grid size in work-items. If the grid dimensionality 3608 is 1 or 2, then must be 1. 3609 3610 "hidden_group_size_x" 3611 The grid dispatch work-group size for the X dimension is 3612 passed in the kernarg. This size only applies to the 3613 non-partial work-groups. This is the same value as the AQL 3614 dispatch packet work-group size. 3615 3616 "hidden_group_size_y" 3617 The grid dispatch work-group size for the Y dimension is 3618 passed in the kernarg. This size only applies to the 3619 non-partial work-groups. This is the same value as the AQL 3620 dispatch packet work-group size. If the grid dimensionality 3621 is 1, then must be 1. 3622 3623 "hidden_group_size_z" 3624 The grid dispatch work-group size for the Z dimension is 3625 passed in the kernarg. This size only applies to the 3626 non-partial work-groups. This is the same value as the AQL 3627 dispatch packet work-group size. If the grid dimensionality 3628 is 1 or 2, then must be 1. 3629 3630 "hidden_remainder_x" 3631 The grid dispatch work group size of the partial work group 3632 of the X dimension, if it exists. Must be zero if a partial 3633 work group does not exist in the X dimension. 3634 3635 "hidden_remainder_y" 3636 The grid dispatch work group size of the partial work group 3637 of the Y dimension, if it exists. Must be zero if a partial 3638 work group does not exist in the Y dimension. 3639 3640 "hidden_remainder_z" 3641 The grid dispatch work group size of the partial work group 3642 of the Z dimension, if it exists. Must be zero if a partial 3643 work group does not exist in the Z dimension. 3644 3645 "hidden_grid_dims" 3646 The grid dispatch dimensionality. This is the same value 3647 as the AQL dispatch packet dimensionality. Must be a value 3648 between 1 and 3. 3649 3650 "hidden_heap_v1" 3651 A global address space pointer to an initialized memory 3652 buffer that conforms to the requirements of the malloc/free 3653 device library V1 version implementation. 3654 3655 "hidden_private_base" 3656 The high 32 bits of the flat addressing private aperture base. 3657 Only used by GFX8 to allow conversion between private segment 3658 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3659 3660 "hidden_shared_base" 3661 The high 32 bits of the flat addressing shared aperture base. 3662 Only used by GFX8 to allow conversion between shared segment 3663 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3664 3665 "hidden_queue_ptr" 3666 A global memory address space pointer to the ROCm runtime 3667 ``struct amd_queue_t`` structure for the HSA queue of the 3668 associated dispatch AQL packet. It is only required for pre-GFX9 3669 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`). 3670 3671 ====================== ============== ========= ================================ 3672 3673.. 3674 3675Kernel Dispatch 3676~~~~~~~~~~~~~~~ 3677 3678The HSA architected queuing language (AQL) defines a user space memory interface 3679that can be used to control the dispatch of kernels, in an agent independent 3680way. An agent can have zero or more AQL queues created for it using an HSA 3681compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3682are 64 bytes) can be placed. See the *HSA Platform System Architecture 3683Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3684 3685The packet processor of a kernel agent is responsible for detecting and 3686dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3687packet processor is implemented by the hardware command processor (CP), 3688asynchronous dispatch controller (ADC) and shader processor input controller 3689(SPI). 3690 3691An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3692the kernel mode driver to initialize and register the AQL queue with CP. 3693 3694To dispatch a kernel the following actions are performed. This can occur in the 3695CPU host program, or from an HSA kernel executing on a GPU. 3696 36971. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3698 executed is obtained. 36992. A pointer to the kernel descriptor (see 3700 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3701 It must be for a kernel that is contained in a code object that was loaded 3702 by an HSA compatible runtime on the kernel agent with which the AQL queue is 3703 associated. 37043. Space is allocated for the kernel arguments using the HSA compatible runtime 3705 allocator for a memory region with the kernarg property for the kernel agent 3706 that will execute the kernel. It must be at least 16-byte aligned. 37074. Kernel argument values are assigned to the kernel argument memory 3708 allocation. The layout is defined in the *HSA Programmer's Language 3709 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3710 kernel argument memory in the same way constant memory is accessed. (Note 3711 that the HSA specification allows an implementation to copy the kernel 3712 argument contents to another location that is accessed by the kernel.) 37135. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3714 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3715 for the packet. The packet must be set up, and the final write must use an 3716 atomic store release to set the packet kind to ensure the packet contents are 3717 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3718 notify the kernel agent that the AQL queue has been updated. These rules, and 3719 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3720 System Architecture Specification* [HSA]_. 37216. A kernel dispatch packet includes information about the actual dispatch, 3722 such as grid and work-group size, together with information from the code 3723 object about the kernel, such as segment sizes. The HSA compatible runtime 3724 queries on the kernel symbol can be used to obtain the code object values 3725 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 37267. CP executes micro-code and is responsible for detecting and setting up the 3727 GPU to execute the wavefronts of a kernel dispatch. 37288. CP ensures that when the a wavefront starts executing the kernel machine 3729 code, the scalar general purpose registers (SGPR) and vector general purpose 3730 registers (VGPR) are set up as required by the machine code. The required 3731 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3732 register state is defined in 3733 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 37349. The prolog of the kernel machine code (see 3735 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3736 before continuing executing the machine code that corresponds to the kernel. 373710. When the kernel dispatch has completed execution, CP signals the completion 3738 signal specified in the kernel dispatch packet if not 0. 3739 3740.. _amdgpu-amdhsa-memory-spaces: 3741 3742Memory Spaces 3743~~~~~~~~~~~~~ 3744 3745The memory space properties are: 3746 3747 .. table:: AMDHSA Memory Spaces 3748 :name: amdgpu-amdhsa-memory-spaces-table 3749 3750 ================= =========== ======== ======= ================== 3751 Memory Space Name HSA Segment Hardware Address NULL Value 3752 Name Name Size 3753 ================= =========== ======== ======= ================== 3754 Private private scratch 32 0x00000000 3755 Local group LDS 32 0xFFFFFFFF 3756 Global global global 64 0x0000000000000000 3757 Constant constant *same as 64 0x0000000000000000 3758 global* 3759 Generic flat flat 64 0x0000000000000000 3760 Region N/A GDS 32 *not implemented 3761 for AMDHSA* 3762 ================= =========== ======== ======= ================== 3763 3764The global and constant memory spaces both use global virtual addresses, which 3765are the same virtual address space used by the CPU. However, some virtual 3766addresses may only be accessible to the CPU, some only accessible by the GPU, 3767and some by both. 3768 3769Using the constant memory space indicates that the data will not change during 3770the execution of the kernel. This allows scalar read instructions to be 3771used. The vector and scalar L1 caches are invalidated of volatile data before 3772each kernel dispatch execution to allow constant memory to change values between 3773kernel dispatches. 3774 3775The local memory space uses the hardware Local Data Store (LDS) which is 3776automatically allocated when the hardware creates work-groups of wavefronts, and 3777freed when all the wavefronts of a work-group have terminated. The data store 3778(DS) instructions can be used to access it. 3779 3780The private memory space uses the hardware scratch memory support. If the kernel 3781uses scratch, then the hardware allocates memory that is accessed using 3782wavefront lane dword (4 byte) interleaving. The mapping used from private 3783address to physical address is: 3784 3785 ``wavefront-scratch-base + 3786 (private-address * wavefront-size * 4) + 3787 (wavefront-lane-id * 4)`` 3788 3789There are different ways that the wavefront scratch base address is determined 3790by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3791memory can be accessed in an interleaved manner using buffer instruction with 3792the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3793instructions, or by flat instructions. If each lane of a wavefront accesses the 3794same private address, the interleaving results in adjacent dwords being accessed 3795and hence requires fewer cache lines to be fetched. Multi-dword access is not 3796supported except by flat and scratch instructions in GFX9-GFX11. 3797 3798The generic address space uses the hardware flat address support available in 3799GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and 3800local apertures), that are outside the range of addressible global memory, to 3801map from a flat address to a private or local address. 3802 3803FLAT instructions can take a flat address and access global, private (scratch) 3804and group (LDS) memory depending on if the address is within one of the 3805aperture ranges. Flat access to scratch requires hardware aperture setup and 3806setup in the kernel prologue (see 3807:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3808hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3809:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3810 3811To convert between a segment address and a flat address the base address of the 3812apertures address can be used. For GFX7-GFX8 these are available in the 3813:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3814Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3815GFX9-GFX11 the aperture base addresses are directly available as inline constant 3816registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3817address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3818which makes it easier to convert from flat to segment or segment to flat. 3819 3820Image and Samplers 3821~~~~~~~~~~~~~~~~~~ 3822 3823Image and sample handles created by an HSA compatible runtime (see 3824:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3825object respectively. In order to support the HSA ``query_sampler`` operations 3826two extra dwords are used to store the HSA BRIG enumeration values for the 3827queries that are not trivially deducible from the S# representation. 3828 3829HSA Signals 3830~~~~~~~~~~~ 3831 3832HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3833are 64-bit addresses of a structure allocated in memory accessible from both the 3834CPU and GPU. The structure is defined by the runtime and subject to change 3835between releases. For example, see [AMD-ROCm-github]_. 3836 3837.. _amdgpu-amdhsa-hsa-aql-queue: 3838 3839HSA AQL Queue 3840~~~~~~~~~~~~~ 3841 3842The HSA AQL queue structure is defined by an HSA compatible runtime (see 3843:ref:`amdgpu-os`) and subject to change between releases. For example, see 3844[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3845certain language features such as the flat address aperture bases. It also 3846contains fields used by CP such as managing the allocation of scratch memory. 3847 3848.. _amdgpu-amdhsa-kernel-descriptor: 3849 3850Kernel Descriptor 3851~~~~~~~~~~~~~~~~~ 3852 3853A kernel descriptor consists of the information needed by CP to initiate the 3854execution of a kernel, including the entry point address of the machine code 3855that implements the kernel. 3856 3857Code Object V3 Kernel Descriptor 3858++++++++++++++++++++++++++++++++ 3859 3860CP microcode requires the Kernel descriptor to be allocated on 64-byte 3861alignment. 3862 3863The fields used by CP for code objects before V3 also match those specified in 3864:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3865 3866 .. table:: Code Object V3 Kernel Descriptor 3867 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3868 3869 ======= ======= =============================== ============================ 3870 Bits Size Field Name Description 3871 ======= ======= =============================== ============================ 3872 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3873 address space memory 3874 required for a work-group 3875 in bytes. This does not 3876 include any dynamically 3877 allocated local address 3878 space memory that may be 3879 added when the kernel is 3880 dispatched. 3881 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3882 private address space 3883 memory required for a 3884 work-item in bytes. 3885 Additional space may need to 3886 be added to this value if 3887 the call stack has 3888 non-inlined function calls. 3889 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3890 memory pointed to by the 3891 AQL dispatch packet. The 3892 kernarg memory is used to 3893 pass arguments to the 3894 kernel. 3895 3896 * If the kernarg pointer in 3897 the dispatch packet is NULL 3898 then there are no kernel 3899 arguments. 3900 * If the kernarg pointer in 3901 the dispatch packet is 3902 not NULL and this value 3903 is 0 then the kernarg 3904 memory size is 3905 unspecified. 3906 * If the kernarg pointer in 3907 the dispatch packet is 3908 not NULL and this value 3909 is not 0 then the value 3910 specifies the kernarg 3911 memory size in bytes. It 3912 is recommended to provide 3913 a value as it may be used 3914 by CP to optimize making 3915 the kernarg memory 3916 visible to the kernel 3917 code. 3918 3919 127:96 4 bytes Reserved, must be 0. 3920 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3921 negative) from base 3922 address of kernel 3923 descriptor to kernel's 3924 entry point instruction 3925 which must be 256 byte 3926 aligned. 3927 351:272 20 Reserved, must be 0. 3928 bytes 3929 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3930 Reserved, must be 0. 3931 GFX90A, GFX940 3932 Compute Shader (CS) 3933 program settings used by 3934 CP to set up 3935 ``COMPUTE_PGM_RSRC3`` 3936 configuration 3937 register. See 3938 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3939 GFX10-GFX11 3940 Compute Shader (CS) 3941 program settings used by 3942 CP to set up 3943 ``COMPUTE_PGM_RSRC3`` 3944 configuration 3945 register. See 3946 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 3947 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3948 program settings used by 3949 CP to set up 3950 ``COMPUTE_PGM_RSRC1`` 3951 configuration 3952 register. See 3953 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 3954 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3955 program settings used by 3956 CP to set up 3957 ``COMPUTE_PGM_RSRC2`` 3958 configuration 3959 register. See 3960 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 3961 458:448 7 bits *See separate bits below.* Enable the setup of the 3962 SGPR user data registers 3963 (see 3964 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3965 3966 The total number of SGPR 3967 user data registers 3968 requested must not exceed 3969 16 and match value in 3970 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3971 Any requests beyond 16 3972 will be ignored. 3973 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3974 _BUFFER column of 3975 :ref:`amdgpu-processor-table` 3976 specifies *Architected flat 3977 scratch* then not supported 3978 and must be 0, 3979 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3980 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3981 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3982 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3983 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3984 column of 3985 :ref:`amdgpu-processor-table` 3986 specifies *Architected flat 3987 scratch* then not supported 3988 and must be 0, 3989 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3990 _SIZE 3991 457:455 3 bits Reserved, must be 0. 3992 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3993 Reserved, must be 0. 3994 GFX10-GFX11 3995 - If 0 execute in 3996 wavefront size 64 mode. 3997 - If 1 execute in 3998 native wavefront size 3999 32 mode. 4000 463:459 1 bit Reserved, must be 0. 4001 464 1 bit RESERVED_464 Deprecated, must be 0. 4002 467:465 3 bits Reserved, must be 0. 4003 468 1 bit RESERVED_468 Deprecated, must be 0. 4004 469:471 3 bits Reserved, must be 0. 4005 511:472 5 bytes Reserved, must be 0. 4006 512 **Total size 64 bytes.** 4007 ======= ==================================================================== 4008 4009.. 4010 4011 .. table:: compute_pgm_rsrc1 for GFX6-GFX11 4012 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table 4013 4014 ======= ======= =============================== =========================================================================== 4015 Bits Size Field Name Description 4016 ======= ======= =============================== =========================================================================== 4017 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 4018 blocks used by each work-item; 4019 granularity is device 4020 specific: 4021 4022 GFX6-GFX9 4023 - vgprs_used 0..256 4024 - max(0, ceil(vgprs_used / 4) - 1) 4025 GFX90A, GFX940 4026 - vgprs_used 0..512 4027 - vgprs_used = align(arch_vgprs, 4) 4028 + acc_vgprs 4029 - max(0, ceil(vgprs_used / 8) - 1) 4030 GFX10-GFX11 (wavefront size 64) 4031 - max_vgpr 1..256 4032 - max(0, ceil(vgprs_used / 4) - 1) 4033 GFX10-GFX11 (wavefront size 32) 4034 - max_vgpr 1..256 4035 - max(0, ceil(vgprs_used / 8) - 1) 4036 4037 Where vgprs_used is defined 4038 as the highest VGPR number 4039 explicitly referenced plus 4040 one. 4041 4042 Used by CP to set up 4043 ``COMPUTE_PGM_RSRC1.VGPRS``. 4044 4045 The 4046 :ref:`amdgpu-assembler` 4047 calculates this 4048 automatically for the 4049 selected processor from 4050 values provided to the 4051 `.amdhsa_kernel` directive 4052 by the 4053 `.amdhsa_next_free_vgpr` 4054 nested directive (see 4055 :ref:`amdhsa-kernel-directives-table`). 4056 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 4057 blocks used by a wavefront; 4058 granularity is device 4059 specific: 4060 4061 GFX6-GFX8 4062 - sgprs_used 0..112 4063 - max(0, ceil(sgprs_used / 8) - 1) 4064 GFX9 4065 - sgprs_used 0..112 4066 - 2 * max(0, ceil(sgprs_used / 16) - 1) 4067 GFX10-GFX11 4068 Reserved, must be 0. 4069 (128 SGPRs always 4070 allocated.) 4071 4072 Where sgprs_used is 4073 defined as the highest 4074 SGPR number explicitly 4075 referenced plus one, plus 4076 a target specific number 4077 of additional special 4078 SGPRs for VCC, 4079 FLAT_SCRATCH (GFX7+) and 4080 XNACK_MASK (GFX8+), and 4081 any additional 4082 target specific 4083 limitations. It does not 4084 include the 16 SGPRs added 4085 if a trap handler is 4086 enabled. 4087 4088 The target specific 4089 limitations and special 4090 SGPR layout are defined in 4091 the hardware 4092 documentation, which can 4093 be found in the 4094 :ref:`amdgpu-processors` 4095 table. 4096 4097 Used by CP to set up 4098 ``COMPUTE_PGM_RSRC1.SGPRS``. 4099 4100 The 4101 :ref:`amdgpu-assembler` 4102 calculates this 4103 automatically for the 4104 selected processor from 4105 values provided to the 4106 `.amdhsa_kernel` directive 4107 by the 4108 `.amdhsa_next_free_sgpr` 4109 and `.amdhsa_reserve_*` 4110 nested directives (see 4111 :ref:`amdhsa-kernel-directives-table`). 4112 11:10 2 bits PRIORITY Must be 0. 4113 4114 Start executing wavefront 4115 at the specified priority. 4116 4117 CP is responsible for 4118 filling in 4119 ``COMPUTE_PGM_RSRC1.PRIORITY``. 4120 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 4121 with specified rounding 4122 mode for single (32 4123 bit) floating point 4124 precision floating point 4125 operations. 4126 4127 Floating point rounding 4128 mode values are defined in 4129 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4130 4131 Used by CP to set up 4132 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4133 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 4134 with specified rounding 4135 denorm mode for half/double (16 4136 and 64-bit) floating point 4137 precision floating point 4138 operations. 4139 4140 Floating point rounding 4141 mode values are defined in 4142 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4143 4144 Used by CP to set up 4145 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4146 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 4147 with specified denorm mode 4148 for single (32 4149 bit) floating point 4150 precision floating point 4151 operations. 4152 4153 Floating point denorm mode 4154 values are defined in 4155 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4156 4157 Used by CP to set up 4158 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4159 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 4160 with specified denorm mode 4161 for half/double (16 4162 and 64-bit) floating point 4163 precision floating point 4164 operations. 4165 4166 Floating point denorm mode 4167 values are defined in 4168 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4169 4170 Used by CP to set up 4171 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4172 20 1 bit PRIV Must be 0. 4173 4174 Start executing wavefront 4175 in privilege trap handler 4176 mode. 4177 4178 CP is responsible for 4179 filling in 4180 ``COMPUTE_PGM_RSRC1.PRIV``. 4181 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 4182 with DX10 clamp mode 4183 enabled. Used by the vector 4184 ALU to force DX10 style 4185 treatment of NaN's (when 4186 set, clamp NaN to zero, 4187 otherwise pass NaN 4188 through). 4189 4190 Used by CP to set up 4191 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 4192 22 1 bit DEBUG_MODE Must be 0. 4193 4194 Start executing wavefront 4195 in single step mode. 4196 4197 CP is responsible for 4198 filling in 4199 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 4200 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 4201 with IEEE mode 4202 enabled. Floating point 4203 opcodes that support 4204 exception flag gathering 4205 will quiet and propagate 4206 signaling-NaN inputs per 4207 IEEE 754-2008. Min_dx10 and 4208 max_dx10 become IEEE 4209 754-2008 compliant due to 4210 signaling-NaN propagation 4211 and quieting. 4212 4213 Used by CP to set up 4214 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 4215 24 1 bit BULKY Must be 0. 4216 4217 Only one work-group allowed 4218 to execute on a compute 4219 unit. 4220 4221 CP is responsible for 4222 filling in 4223 ``COMPUTE_PGM_RSRC1.BULKY``. 4224 25 1 bit CDBG_USER Must be 0. 4225 4226 Flag that can be used to 4227 control debugging code. 4228 4229 CP is responsible for 4230 filling in 4231 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 4232 26 1 bit FP16_OVFL GFX6-GFX8 4233 Reserved, must be 0. 4234 GFX9-GFX11 4235 Wavefront starts execution 4236 with specified fp16 overflow 4237 mode. 4238 4239 - If 0, fp16 overflow generates 4240 +/-INF values. 4241 - If 1, fp16 overflow that is the 4242 result of an +/-INF input value 4243 or divide by 0 produces a +/-INF, 4244 otherwise clamps computed 4245 overflow to +/-MAX_FP16 as 4246 appropriate. 4247 4248 Used by CP to set up 4249 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 4250 28:27 2 bits Reserved, must be 0. 4251 29 1 bit WGP_MODE GFX6-GFX9 4252 Reserved, must be 0. 4253 GFX10-GFX11 4254 - If 0 execute work-groups in 4255 CU wavefront execution mode. 4256 - If 1 execute work-groups on 4257 in WGP wavefront execution mode. 4258 4259 See :ref:`amdgpu-amdhsa-memory-model`. 4260 4261 Used by CP to set up 4262 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 4263 30 1 bit MEM_ORDERED GFX6-GFX9 4264 Reserved, must be 0. 4265 GFX10-GFX11 4266 Controls the behavior of the 4267 s_waitcnt's vmcnt and vscnt 4268 counters. 4269 4270 - If 0 vmcnt reports completion 4271 of load and atomic with return 4272 out of order with sample 4273 instructions, and the vscnt 4274 reports the completion of 4275 store and atomic without 4276 return in order. 4277 - If 1 vmcnt reports completion 4278 of load, atomic with return 4279 and sample instructions in 4280 order, and the vscnt reports 4281 the completion of store and 4282 atomic without return in order. 4283 4284 Used by CP to set up 4285 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 4286 31 1 bit FWD_PROGRESS GFX6-GFX9 4287 Reserved, must be 0. 4288 GFX10-GFX11 4289 - If 0 execute SIMD wavefronts 4290 using oldest first policy. 4291 - If 1 execute SIMD wavefronts to 4292 ensure wavefronts will make some 4293 forward progress. 4294 4295 Used by CP to set up 4296 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 4297 32 **Total size 4 bytes** 4298 ======= =================================================================================================================== 4299 4300.. 4301 4302 .. table:: compute_pgm_rsrc2 for GFX6-GFX11 4303 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table 4304 4305 ======= ======= =============================== =========================================================================== 4306 Bits Size Field Name Description 4307 ======= ======= =============================== =========================================================================== 4308 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 4309 private segment. 4310 * If the *Target Properties* 4311 column of 4312 :ref:`amdgpu-processor-table` 4313 does not specify 4314 *Architected flat 4315 scratch* then enable the 4316 setup of the SGPR 4317 wavefront scratch offset 4318 system register (see 4319 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4320 * If the *Target Properties* 4321 column of 4322 :ref:`amdgpu-processor-table` 4323 specifies *Architected 4324 flat scratch* then enable 4325 the setup of the 4326 FLAT_SCRATCH register 4327 pair (see 4328 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4329 4330 Used by CP to set up 4331 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4332 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4333 user data 4334 registers requested. This 4335 number must be greater than 4336 or equal to the number of user 4337 data registers enabled. 4338 4339 Used by CP to set up 4340 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4341 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4342 4343 This bit represents 4344 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4345 which is set by the CP if 4346 the runtime has installed a 4347 trap handler. 4348 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4349 system SGPR register for 4350 the work-group id in the X 4351 dimension (see 4352 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4353 4354 Used by CP to set up 4355 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4356 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4357 system SGPR register for 4358 the work-group id in the Y 4359 dimension (see 4360 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4361 4362 Used by CP to set up 4363 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4364 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4365 system SGPR register for 4366 the work-group id in the Z 4367 dimension (see 4368 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4369 4370 Used by CP to set up 4371 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4372 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4373 system SGPR register for 4374 work-group information (see 4375 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4376 4377 Used by CP to set up 4378 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4379 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4380 VGPR system registers used 4381 for the work-item ID. 4382 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4383 defines the values. 4384 4385 Used by CP to set up 4386 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4387 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4388 4389 Wavefront starts execution 4390 with address watch 4391 exceptions enabled which 4392 are generated when L1 has 4393 witnessed a thread access 4394 an *address of 4395 interest*. 4396 4397 CP is responsible for 4398 filling in the address 4399 watch bit in 4400 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4401 according to what the 4402 runtime requests. 4403 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4404 4405 Wavefront starts execution 4406 with memory violation 4407 exceptions exceptions 4408 enabled which are generated 4409 when a memory violation has 4410 occurred for this wavefront from 4411 L1 or LDS 4412 (write-to-read-only-memory, 4413 mis-aligned atomic, LDS 4414 address out of range, 4415 illegal address, etc.). 4416 4417 CP sets the memory 4418 violation bit in 4419 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4420 according to what the 4421 runtime requests. 4422 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4423 4424 CP uses the rounded value 4425 from the dispatch packet, 4426 not this value, as the 4427 dispatch may contain 4428 dynamically allocated group 4429 segment memory. CP writes 4430 directly to 4431 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4432 4433 Amount of group segment 4434 (LDS) to allocate for each 4435 work-group. Granularity is 4436 device specific: 4437 4438 GFX6 4439 roundup(lds-size / (64 * 4)) 4440 GFX7-GFX11 4441 roundup(lds-size / (128 * 4)) 4442 4443 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4444 _INVALID_OPERATION with specified exceptions 4445 enabled. 4446 4447 Used by CP to set up 4448 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4449 (set from bits 0..6). 4450 4451 IEEE 754 FP Invalid 4452 Operation 4453 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4454 _SOURCE input operands is a 4455 denormal number 4456 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4457 _DIVISION_BY_ZERO Zero 4458 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4459 _OVERFLOW 4460 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4461 _UNDERFLOW 4462 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4463 _INEXACT 4464 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4465 _ZERO (rcp_iflag_f32 instruction 4466 only) 4467 31 1 bit Reserved, must be 0. 4468 32 **Total size 4 bytes.** 4469 ======= =================================================================================================================== 4470 4471.. 4472 4473 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940 4474 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4475 4476 ======= ======= =============================== =========================================================================== 4477 Bits Size Field Name Description 4478 ======= ======= =============================== =========================================================================== 4479 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4480 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4481 63 - accum-offset = 256. 4482 6:15 10 Reserved, must be 0. 4483 bits 4484 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4485 launched in the same CU. 4486 - If 1 the waves of a work-group can be 4487 launched in different CUs. The waves 4488 cannot use S_BARRIER or LDS. 4489 17:31 15 Reserved, must be 0. 4490 bits 4491 32 **Total size 4 bytes.** 4492 ======= =================================================================================================================== 4493 4494.. 4495 4496 .. table:: compute_pgm_rsrc3 for GFX10-GFX11 4497 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table 4498 4499 ======= ======= =============================== =========================================================================== 4500 Bits Size Field Name Description 4501 ======= ======= =============================== =========================================================================== 4502 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For 4503 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity 4504 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does 4505 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0. 4506 9:4 6 bits INST_PREF_SIZE GFX10 4507 Reserved, must be 0. 4508 GFX11 4509 Number of instruction bytes to prefetch, starting at the kernel's entry 4510 point instruction, before wavefront starts execution. The value is 0..63 4511 with a granularity of 128 bytes. 4512 10 1 bit TRAP_ON_START GFX10 4513 Reserved, must be 0. 4514 GFX11 4515 Must be 0. 4516 4517 If 1, wavefront starts execution by trapping into the trap handler. 4518 4519 CP is responsible for filling in the trap on start bit in 4520 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime 4521 requests. 4522 11 1 bit TRAP_ON_END GFX10 4523 Reserved, must be 0. 4524 GFX11 4525 Must be 0. 4526 4527 If 1, wavefront execution terminates by trapping into the trap handler. 4528 4529 CP is responsible for filling in the trap on end bit in 4530 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests. 4531 30:12 19 bits Reserved, must be 0. 4532 31 1 bit IMAGE_OP GFX10 4533 Reserved, must be 0. 4534 GFX11 4535 If 1, the kernel execution contains image instructions. If executed as 4536 part of a graphics pipeline, image read instructions will stall waiting 4537 for any necessary ``WAIT_SYNC`` fence to be performed in order to 4538 indicate that earlier pipeline stages have completed writing to the 4539 image. 4540 4541 Not used for compute kernels that are not part of a graphics pipeline and 4542 must be 0. 4543 32 **Total size 4 bytes.** 4544 ======= =================================================================================================================== 4545 4546.. 4547 4548 .. table:: Floating Point Rounding Mode Enumeration Values 4549 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4550 4551 ====================================== ===== ============================== 4552 Enumeration Name Value Description 4553 ====================================== ===== ============================== 4554 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4555 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4556 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4557 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4558 ====================================== ===== ============================== 4559 4560.. 4561 4562 .. table:: Floating Point Denorm Mode Enumeration Values 4563 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4564 4565 ====================================== ===== ============================== 4566 Enumeration Name Value Description 4567 ====================================== ===== ============================== 4568 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4569 Denorms 4570 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4571 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4572 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4573 ====================================== ===== ============================== 4574 4575.. 4576 4577 .. table:: System VGPR Work-Item ID Enumeration Values 4578 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4579 4580 ======================================== ===== ============================ 4581 Enumeration Name Value Description 4582 ======================================== ===== ============================ 4583 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4584 ID. 4585 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4586 dimensions ID. 4587 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4588 dimensions ID. 4589 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4590 ======================================== ===== ============================ 4591 4592.. _amdgpu-amdhsa-initial-kernel-execution-state: 4593 4594Initial Kernel Execution State 4595~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4596 4597This section defines the register state that will be set up by the packet 4598processor prior to the start of execution of every wavefront. This is limited by 4599the constraints of the hardware controllers of CP/ADC/SPI. 4600 4601The order of the SGPR registers is defined, but the compiler can specify which 4602ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4603fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4604for enabled registers are dense starting at SGPR0: the first enabled register is 4605SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4606an SGPR number. 4607 4608The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4609all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4610using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4611actually initialized. These are then immediately followed by the System SGPRs 4612that are set up by ADC/SPI and can have different values for each wavefront of 4613the grid dispatch. 4614 4615SGPR register initial state is defined in 4616:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4617 4618 .. table:: SGPR Register Set Up Order 4619 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4620 4621 ========== ========================== ====== ============================== 4622 SGPR Order Name Number Description 4623 (kernel descriptor enable of 4624 field) SGPRs 4625 ========== ========================== ====== ============================== 4626 First Private Segment Buffer 4 See 4627 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4628 _segment_buffer) 4629 then Dispatch Ptr 2 64-bit address of AQL dispatch 4630 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4631 actually executing. 4632 then Queue Ptr 2 64-bit address of amd_queue_t 4633 (enable_sgpr_queue_ptr) object for AQL queue on which 4634 the dispatch packet was 4635 queued. 4636 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4637 (enable_sgpr_kernarg segment. This is directly 4638 _segment_ptr) copied from the 4639 kernarg_address in the kernel 4640 dispatch packet. 4641 4642 Having CP load it once avoids 4643 loading it at the beginning of 4644 every wavefront. 4645 then Dispatch Id 2 64-bit Dispatch ID of the 4646 (enable_sgpr_dispatch_id) dispatch packet being 4647 executed. 4648 then Flat Scratch Init 2 See 4649 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4650 _init) 4651 then Private Segment Size 1 The 32-bit byte size of a 4652 (enable_sgpr_private single work-item's memory 4653 _segment_size) allocation. This is the 4654 value from the kernel 4655 dispatch packet Private 4656 Segment Byte Size rounded up 4657 by CP to a multiple of 4658 DWORD. 4659 4660 Having CP load it once avoids 4661 loading it at the beginning of 4662 every wavefront. 4663 4664 This is not used for 4665 GFX7-GFX8 since it is the same 4666 value as the second SGPR of 4667 Flat Scratch Init. However, it 4668 may be needed for GFX9-GFX11 which 4669 changes the meaning of the 4670 Flat Scratch Init value. 4671 then Work-Group Id X 1 32-bit work-group id in X 4672 (enable_sgpr_workgroup_id dimension of grid for 4673 _X) wavefront. 4674 then Work-Group Id Y 1 32-bit work-group id in Y 4675 (enable_sgpr_workgroup_id dimension of grid for 4676 _Y) wavefront. 4677 then Work-Group Id Z 1 32-bit work-group id in Z 4678 (enable_sgpr_workgroup_id dimension of grid for 4679 _Z) wavefront. 4680 then Work-Group Info 1 {first_wavefront, 14'b0000, 4681 (enable_sgpr_workgroup ordered_append_term[10:0], 4682 _info) threadgroup_size_in_wavefronts[5:0]} 4683 then Scratch Wavefront Offset 1 See 4684 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4685 _segment_wavefront_offset) and 4686 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4687 ========== ========================== ====== ============================== 4688 4689The order of the VGPR registers is defined, but the compiler can specify which 4690ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4691fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4692for enabled registers are dense starting at VGPR0: the first enabled register is 4693VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4694VGPR number. 4695 4696There are different methods used for the VGPR initial state: 4697 4698* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4699 specifies otherwise, a separate VGPR register is used per work-item ID. The 4700 VGPR register initial state for this method is defined in 4701 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4702* If *Target Properties* column of :ref:`amdgpu-processor-table` 4703 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4704 for all work-item IDs. The register layout for this method is defined in 4705 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4706 4707 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4708 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4709 4710 ========== ========================== ====== ============================== 4711 VGPR Order Name Number Description 4712 (kernel descriptor enable of 4713 field) VGPRs 4714 ========== ========================== ====== ============================== 4715 First Work-Item Id X 1 32-bit work-item id in X 4716 (Always initialized) dimension of work-group for 4717 wavefront lane. 4718 then Work-Item Id Y 1 32-bit work-item id in Y 4719 (enable_vgpr_workitem_id dimension of work-group for 4720 > 0) wavefront lane. 4721 then Work-Item Id Z 1 32-bit work-item id in Z 4722 (enable_vgpr_workitem_id dimension of work-group for 4723 > 1) wavefront lane. 4724 ========== ========================== ====== ============================== 4725 4726.. 4727 4728 .. table:: Register Layout for Packed Work-Item ID Method 4729 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4730 4731 ======= ======= ================ ========================================= 4732 Bits Size Field Name Description 4733 ======= ======= ================ ========================================= 4734 0:9 10 bits Work-Item Id X Work-item id in X 4735 dimension of work-group for 4736 wavefront lane. 4737 4738 Always initialized. 4739 4740 10:19 10 bits Work-Item Id Y Work-item id in Y 4741 dimension of work-group for 4742 wavefront lane. 4743 4744 Initialized if enable_vgpr_workitem_id > 4745 0, otherwise set to 0. 4746 20:29 10 bits Work-Item Id Z Work-item id in Z 4747 dimension of work-group for 4748 wavefront lane. 4749 4750 Initialized if enable_vgpr_workitem_id > 4751 1, otherwise set to 0. 4752 30:31 2 bits Reserved, set to 0. 4753 ======= ======= ================ ========================================= 4754 4755The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4756 47571. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4758 registers. 47592. Work-group Id registers X, Y, Z are set by ADC which supports any 4760 combination including none. 47613. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4762 its value cannot be included with the flat scratch init value which is per 4763 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 47644. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4765 or (X, Y, Z). 47665. Flat Scratch register pair initialization is described in 4767 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4768 4769The global segment can be accessed either using buffer instructions (GFX6 which 4770has V# 64-bit address support), flat instructions (GFX7-GFX11), or global 4771instructions (GFX9-GFX11). 4772 4773If buffer operations are used, then the compiler can generate a V# with the 4774following properties: 4775 4776* base address of 0 4777* no swizzle 4778* ATC: 1 if IOMMU present (such as APU) 4779* ptr64: 1 4780* MTYPE set to support memory coherence that matches the runtime (such as CC for 4781 APU and NC for dGPU). 4782 4783.. _amdgpu-amdhsa-kernel-prolog: 4784 4785Kernel Prolog 4786~~~~~~~~~~~~~ 4787 4788The compiler performs initialization in the kernel prologue depending on the 4789target and information about things like stack usage in the kernel and called 4790functions. Some of this initialization requires the compiler to request certain 4791User and System SGPRs be present in the 4792:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4793:ref:`amdgpu-amdhsa-kernel-descriptor`. 4794 4795.. _amdgpu-amdhsa-kernel-prolog-cfi: 4796 4797CFI 4798+++ 4799 48001. The CFI return address is undefined. 4801 48022. The CFI CFA is defined using an expression which evaluates to a location 4803 description that comprises one memory location description for the 4804 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4805 4806.. _amdgpu-amdhsa-kernel-prolog-m0: 4807 4808M0 4809++ 4810 4811GFX6-GFX8 4812 The M0 register must be initialized with a value at least the total LDS size 4813 if the kernel may access LDS via DS or flat operations. Total LDS size is 4814 available in dispatch packet. For M0, it is also possible to use maximum 4815 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4816 GFX7-GFX8). 4817GFX9-GFX11 4818 The M0 register is not used for range checking LDS accesses and so does not 4819 need to be initialized in the prolog. 4820 4821.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4822 4823Stack Pointer 4824+++++++++++++ 4825 4826If the kernel has function calls it must set up the ABI stack pointer described 4827in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4828SGPR32 to the unswizzled scratch offset of the address past the last local 4829allocation. 4830 4831.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4832 4833Frame Pointer 4834+++++++++++++ 4835 4836If the kernel needs a frame pointer for the reasons defined in 4837``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4838kernel prolog. If a frame pointer is not required then all uses of the frame 4839pointer are replaced with immediate ``0`` offsets. 4840 4841.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4842 4843Flat Scratch 4844++++++++++++ 4845 4846There are different methods used for initializing flat scratch: 4847 4848* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4849 specifies *Does not support generic address space*: 4850 4851 Flat scratch is not supported and there is no flat scratch register pair. 4852 4853* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4854 specifies *Offset flat scratch*: 4855 4856 If the kernel or any function it calls may use flat operations to access 4857 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4858 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4859 Scratch Wavefront Offset SGPR registers (see 4860 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4861 4862 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4863 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4864 being managed by SPI for the queue executing the kernel dispatch. This is 4865 the same value used in the Scratch Segment Buffer V# base address. 4866 4867 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4868 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4869 4870 The prolog must add the value of Scratch Wavefront Offset to get the 4871 wavefront's byte scratch backing memory offset from 4872 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4873 4874 The Scratch Wavefront Offset must also be used as an offset with Private 4875 segment address when using the Scratch Segment Buffer. 4876 4877 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4878 shifted by 8 before moving into FLAT_SCRATCH_HI. 4879 4880 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4881 SGPRn is the highest numbered SGPR allocated to the wavefront). 4882 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4883 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4884 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4885 aperture. 4886 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4887 work-items scratch memory usage. 4888 4889 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4890 checks that the value in the kernel dispatch packet Private Segment Byte 4891 Size is not larger and requests the runtime to increase the queue's scratch 4892 size if necessary. 4893 4894 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4895 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4896 loading it at the beginning of every wavefront. 4897 4898 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4899 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4900 in flat memory instructions. 4901 4902* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4903 specifies *Absolute flat scratch*: 4904 4905 If the kernel or any function it calls may use flat operations to access 4906 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4907 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4908 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4909 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4910 4911 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4912 memory being managed by SPI for the queue executing the kernel dispatch. 4913 4914 CP obtains this from the runtime. 4915 4916 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4917 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4918 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4919 memory instructions. 4920 4921 The Scratch Wavefront Offset must also be used as an offset with Private 4922 segment address when using the Scratch Segment Buffer (see 4923 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4924 4925* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4926 specifies *Architected flat scratch*: 4927 4928 If ENABLE_PRIVATE_SEGMENT is enabled in 4929 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH 4930 register pair will be initialized to the 64-bit address of the base of scratch 4931 backing memory being managed by SPI for the queue executing the kernel 4932 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4933 flat scratch base in flat memory instructions. 4934 4935.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4936 4937Private Segment Buffer 4938++++++++++++++++++++++ 4939 4940If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4941*Architected flat scratch* then a Private Segment Buffer is not supported. 4942Instead the flat SCRATCH instructions are used. 4943 4944Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4945that are used as a V# to access scratch. CP uses the value provided by the 4946runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4947access the private memory space using a segment address. See 4948:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4949 4950The scratch V# is a four-aligned SGPR and always selected for the kernel as 4951follows: 4952 4953 - If it is known during instruction selection that there is stack usage, 4954 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4955 optimizations are disabled (``-O0``), if stack objects already exist (for 4956 locals, etc.), or if there are any function calls. 4957 4958 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4959 are reserved for the tentative scratch V#. These will be used if it is 4960 determined that spilling is needed. 4961 4962 - If no use is made of the tentative scratch V#, then it is unreserved, 4963 and the register count is determined ignoring it. 4964 - If use is made of the tentative scratch V#, then its register numbers 4965 are shifted to the first four-aligned SGPR index after the highest one 4966 allocated by the register allocator, and all uses are updated. The 4967 register count includes them in the shifted location. 4968 - In either case, if the processor has the SGPR allocation bug, the 4969 tentative allocation is not shifted or unreserved in order to ensure 4970 the register count is higher to workaround the bug. 4971 4972 .. note:: 4973 4974 This approach of using a tentative scratch V# and shifting the register 4975 numbers if used avoids having to perform register allocation a second 4976 time if the tentative V# is eliminated. This is more efficient and 4977 avoids the problem that the second register allocation may perform 4978 spilling which will fail as there is no longer a scratch V#. 4979 4980When the kernel prolog code is being emitted it is known whether the scratch V# 4981described above is actually used. If it is, the prolog code must set it up by 4982copying the Private Segment Buffer to the scratch V# registers and then adding 4983the Private Segment Wavefront Offset to the queue base address in the V#. The 4984result is a V# with a base address pointing to the beginning of the wavefront 4985scratch backing memory. 4986 4987The Private Segment Buffer is always requested, but the Private Segment 4988Wavefront Offset is only requested if it is used (see 4989:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4990 4991.. _amdgpu-amdhsa-memory-model: 4992 4993Memory Model 4994~~~~~~~~~~~~ 4995 4996This section describes the mapping of the LLVM memory model onto AMDGPU machine 4997code (see :ref:`memmodel`). 4998 4999The AMDGPU backend supports the memory synchronization scopes specified in 5000:ref:`amdgpu-memory-scopes`. 5001 5002The code sequences used to implement the memory model specify the order of 5003instructions that a single thread must execute. The ``s_waitcnt`` and cache 5004management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 5005to other memory instructions executed by the same thread. This allows them to be 5006moved earlier or later which can allow them to be combined with other instances 5007of the same instruction, or hoisted/sunk out of loops to improve performance. 5008Only the instructions related to the memory model are given; additional 5009``s_waitcnt`` instructions are required to ensure registers are defined before 5010being used. These may be able to be combined with the memory model ``s_waitcnt`` 5011instructions as described above. 5012 5013The AMDGPU backend supports the following memory models: 5014 5015 HSA Memory Model [HSA]_ 5016 The HSA memory model uses a single happens-before relation for all address 5017 spaces (see :ref:`amdgpu-address-spaces`). 5018 OpenCL Memory Model [OpenCL]_ 5019 The OpenCL memory model which has separate happens-before relations for the 5020 global and local address spaces. Only a fence specifying both global and 5021 local address space, and seq_cst instructions join the relationships. Since 5022 the LLVM ``memfence`` instruction does not allow an address space to be 5023 specified the OpenCL fence has to conservatively assume both local and 5024 global address space was specified. However, optimizations can often be 5025 done to eliminate the additional ``s_waitcnt`` instructions when there are 5026 no intervening memory instructions which access the corresponding address 5027 space. The code sequences in the table indicate what can be omitted for the 5028 OpenCL memory. The target triple environment is used to determine if the 5029 source language is OpenCL (see :ref:`amdgpu-opencl`). 5030 5031``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 5032operations. 5033 5034``buffer/global/flat_load/store/atomic`` instructions to global memory are 5035termed vector memory operations. 5036 5037Private address space uses ``buffer_load/store`` using the scratch V# 5038(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread 5039is accessing the memory, atomic memory orderings are not meaningful, and all 5040accesses are treated as non-atomic. 5041 5042Constant address space uses ``buffer/global_load`` instructions (or equivalent 5043scalar memory instructions). Since the constant address space contents do not 5044change during the execution of a kernel dispatch it is not legal to perform 5045stores, and atomic memory orderings are not meaningful, and all accesses are 5046treated as non-atomic. 5047 5048A memory synchronization scope wider than work-group is not meaningful for the 5049group (LDS) address space and is treated as work-group. 5050 5051The memory model does not support the region address space which is treated as 5052non-atomic. 5053 5054Acquire memory ordering is not meaningful on store atomic instructions and is 5055treated as non-atomic. 5056 5057Release memory ordering is not meaningful on load atomic instructions and is 5058treated a non-atomic. 5059 5060Acquire-release memory ordering is not meaningful on load or store atomic 5061instructions and is treated as acquire and release respectively. 5062 5063The memory order also adds the single thread optimization constraints defined in 5064table 5065:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 5066 5067 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 5068 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 5069 5070 ============ ============================================================== 5071 LLVM Memory Optimization Constraints 5072 Ordering 5073 ============ ============================================================== 5074 unordered *none* 5075 monotonic *none* 5076 acquire - If a load atomic/atomicrmw then no following load/load 5077 atomic/store/store atomic/atomicrmw/fence instruction can be 5078 moved before the acquire. 5079 - If a fence then same as load atomic, plus no preceding 5080 associated fence-paired-atomic can be moved after the fence. 5081 release - If a store atomic/atomicrmw then no preceding load/load 5082 atomic/store/store atomic/atomicrmw/fence instruction can be 5083 moved after the release. 5084 - If a fence then same as store atomic, plus no following 5085 associated fence-paired-atomic can be moved before the 5086 fence. 5087 acq_rel Same constraints as both acquire and release. 5088 seq_cst - If a load atomic then same constraints as acquire, plus no 5089 preceding sequentially consistent load atomic/store 5090 atomic/atomicrmw/fence instruction can be moved after the 5091 seq_cst. 5092 - If a store atomic then the same constraints as release, plus 5093 no following sequentially consistent load atomic/store 5094 atomic/atomicrmw/fence instruction can be moved before the 5095 seq_cst. 5096 - If an atomicrmw/fence then same constraints as acq_rel. 5097 ============ ============================================================== 5098 5099The code sequences used to implement the memory model are defined in the 5100following sections: 5101 5102* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 5103* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 5104* :ref:`amdgpu-amdhsa-memory-model-gfx940` 5105* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11` 5106 5107.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 5108 5109Memory Model GFX6-GFX9 5110++++++++++++++++++++++ 5111 5112For GFX6-GFX9: 5113 5114* Each agent has multiple shader arrays (SA). 5115* Each SA has multiple compute units (CU). 5116* Each CU has multiple SIMDs that execute wavefronts. 5117* The wavefronts for a single work-group are executed in the same CU but may be 5118 executed by different SIMDs. 5119* Each CU has a single LDS memory shared by the wavefronts of the work-groups 5120 executing on it. 5121* All LDS operations of a CU are performed as wavefront wide operations in a 5122 global order and involve no caching. Completion is reported to a wavefront in 5123 execution order. 5124* The LDS memory has multiple request queues shared by the SIMDs of a 5125 CU. Therefore, the LDS operations performed by different wavefronts of a 5126 work-group can be reordered relative to each other, which can result in 5127 reordering the visibility of vector memory operations with respect to LDS 5128 operations of other wavefronts in the same work-group. A ``s_waitcnt 5129 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 5130 vector memory operations between wavefronts of a work-group, but not between 5131 operations performed by the same wavefront. 5132* The vector memory operations are performed as wavefront wide operations and 5133 completion is reported to a wavefront in execution order. The exception is 5134 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 5135 vector memory order if they access LDS memory, and out of LDS operation order 5136 if they access global memory. 5137* The vector memory operations access a single vector L1 cache shared by all 5138 SIMDs a CU. Therefore, no special action is required for coherence between the 5139 lanes of a single wavefront, or for coherence between wavefronts in the same 5140 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 5141 wavefronts executing in different work-groups as they may be executing on 5142 different CUs. 5143* The scalar memory operations access a scalar L1 cache shared by all wavefronts 5144 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 5145 scalar operations are used in a restricted way so do not impact the memory 5146 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 5147* The vector and scalar memory operations use an L2 cache shared by all CUs on 5148 the same agent. 5149* The L2 cache has independent channels to service disjoint ranges of virtual 5150 addresses. 5151* Each CU has a separate request queue per channel. Therefore, the vector and 5152 scalar memory operations performed by wavefronts executing in different 5153 work-groups (which may be executing on different CUs) of an agent can be 5154 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 5155 ensure synchronization between vector memory operations of different CUs. It 5156 ensures a previous vector memory operation has completed before executing a 5157 subsequent vector memory or LDS operation and so can be used to meet the 5158 requirements of acquire and release. 5159* The L2 cache can be kept coherent with other agents on some targets, or ranges 5160 of virtual addresses can be set up to bypass it to ensure system coherence. 5161 5162Scalar memory operations are only used to access memory that is proven to not 5163change during the execution of the kernel dispatch. This includes constant 5164address space and global address space for program scope ``const`` variables. 5165Therefore, the kernel machine code does not have to maintain the scalar cache to 5166ensure it is coherent with the vector caches. The scalar and vector caches are 5167invalidated between kernel dispatches by CP since constant address space data 5168may change between kernel dispatch executions. See 5169:ref:`amdgpu-amdhsa-memory-spaces`. 5170 5171The one exception is if scalar writes are used to spill SGPR registers. In this 5172case the AMDGPU backend ensures the memory location used to spill is never 5173accessed by vector memory operations at the same time. If scalar writes are used 5174then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 5175return since the locations may be used for vector memory instructions by a 5176future wavefront that uses the same scratch area, or a function call that 5177creates a frame at the same address, respectively. There is no need for a 5178``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 5179 5180For kernarg backing memory: 5181 5182* CP invalidates the L1 cache at the start of each kernel dispatch. 5183* On dGPU the kernarg backing memory is allocated in host memory accessed as 5184 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 5185 causes it to be treated as non-volatile and so is not invalidated by 5186 ``*_vol``. 5187* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 5188 and so the L2 cache will be coherent with the CPU and other agents. 5189 5190Scratch backing memory (which is used for the private address space) is accessed 5191with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 5192only accessed by a single thread, and is always write-before-read, there is 5193never a need to invalidate these entries from the L1 cache. Hence all cache 5194invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 5195 5196The code sequences used to implement the memory model for GFX6-GFX9 are defined 5197in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 5198 5199 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 5200 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 5201 5202 ============ ============ ============== ========== ================================ 5203 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 5204 Ordering Sync Scope Address GFX6-GFX9 5205 Space 5206 ============ ============ ============== ========== ================================ 5207 **Non-Atomic** 5208 ------------------------------------------------------------------------------------ 5209 load *none* *none* - global - !volatile & !nontemporal 5210 - generic 5211 - private 1. buffer/global/flat_load 5212 - constant 5213 - !volatile & nontemporal 5214 5215 1. buffer/global/flat_load 5216 glc=1 slc=1 5217 5218 - volatile 5219 5220 1. buffer/global/flat_load 5221 glc=1 5222 2. s_waitcnt vmcnt(0) 5223 5224 - Must happen before 5225 any following volatile 5226 global/generic 5227 load/store. 5228 - Ensures that 5229 volatile 5230 operations to 5231 different 5232 addresses will not 5233 be reordered by 5234 hardware. 5235 5236 load *none* *none* - local 1. ds_load 5237 store *none* *none* - global - !volatile & !nontemporal 5238 - generic 5239 - private 1. buffer/global/flat_store 5240 - constant 5241 - !volatile & nontemporal 5242 5243 1. buffer/global/flat_store 5244 glc=1 slc=1 5245 5246 - volatile 5247 5248 1. buffer/global/flat_store 5249 2. s_waitcnt vmcnt(0) 5250 5251 - Must happen before 5252 any following volatile 5253 global/generic 5254 load/store. 5255 - Ensures that 5256 volatile 5257 operations to 5258 different 5259 addresses will not 5260 be reordered by 5261 hardware. 5262 5263 store *none* *none* - local 1. ds_store 5264 **Unordered Atomic** 5265 ------------------------------------------------------------------------------------ 5266 load atomic unordered *any* *any* *Same as non-atomic*. 5267 store atomic unordered *any* *any* *Same as non-atomic*. 5268 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 5269 **Monotonic Atomic** 5270 ------------------------------------------------------------------------------------ 5271 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 5272 - wavefront - local 5273 - workgroup - generic 5274 load atomic monotonic - agent - global 1. buffer/global/flat_load 5275 - system - generic glc=1 5276 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 5277 - wavefront - generic 5278 - workgroup 5279 - agent 5280 - system 5281 store atomic monotonic - singlethread - local 1. ds_store 5282 - wavefront 5283 - workgroup 5284 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 5285 - wavefront - generic 5286 - workgroup 5287 - agent 5288 - system 5289 atomicrmw monotonic - singlethread - local 1. ds_atomic 5290 - wavefront 5291 - workgroup 5292 **Acquire Atomic** 5293 ------------------------------------------------------------------------------------ 5294 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 5295 - wavefront - local 5296 - generic 5297 load atomic acquire - workgroup - global 1. buffer/global_load 5298 load atomic acquire - workgroup - local 1. ds/flat_load 5299 - generic 2. s_waitcnt lgkmcnt(0) 5300 5301 - If OpenCL, omit. 5302 - Must happen before 5303 any following 5304 global/generic 5305 load/load 5306 atomic/store/store 5307 atomic/atomicrmw. 5308 - Ensures any 5309 following global 5310 data read is no 5311 older than a local load 5312 atomic value being 5313 acquired. 5314 5315 load atomic acquire - agent - global 1. buffer/global_load 5316 - system glc=1 5317 2. s_waitcnt vmcnt(0) 5318 5319 - Must happen before 5320 following 5321 buffer_wbinvl1_vol. 5322 - Ensures the load 5323 has completed 5324 before invalidating 5325 the cache. 5326 5327 3. buffer_wbinvl1_vol 5328 5329 - Must happen before 5330 any following 5331 global/generic 5332 load/load 5333 atomic/atomicrmw. 5334 - Ensures that 5335 following 5336 loads will not see 5337 stale global data. 5338 5339 load atomic acquire - agent - generic 1. flat_load glc=1 5340 - system 2. s_waitcnt vmcnt(0) & 5341 lgkmcnt(0) 5342 5343 - If OpenCL omit 5344 lgkmcnt(0). 5345 - Must happen before 5346 following 5347 buffer_wbinvl1_vol. 5348 - Ensures the flat_load 5349 has completed 5350 before invalidating 5351 the cache. 5352 5353 3. buffer_wbinvl1_vol 5354 5355 - Must happen before 5356 any following 5357 global/generic 5358 load/load 5359 atomic/atomicrmw. 5360 - Ensures that 5361 following loads 5362 will not see stale 5363 global data. 5364 5365 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5366 - wavefront - local 5367 - generic 5368 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5369 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5370 - generic 2. s_waitcnt lgkmcnt(0) 5371 5372 - If OpenCL, omit. 5373 - Must happen before 5374 any following 5375 global/generic 5376 load/load 5377 atomic/store/store 5378 atomic/atomicrmw. 5379 - Ensures any 5380 following global 5381 data read is no 5382 older than a local 5383 atomicrmw value 5384 being acquired. 5385 5386 atomicrmw acquire - agent - global 1. buffer/global_atomic 5387 - system 2. s_waitcnt vmcnt(0) 5388 5389 - Must happen before 5390 following 5391 buffer_wbinvl1_vol. 5392 - Ensures the 5393 atomicrmw has 5394 completed before 5395 invalidating the 5396 cache. 5397 5398 3. buffer_wbinvl1_vol 5399 5400 - Must happen before 5401 any following 5402 global/generic 5403 load/load 5404 atomic/atomicrmw. 5405 - Ensures that 5406 following loads 5407 will not see stale 5408 global data. 5409 5410 atomicrmw acquire - agent - generic 1. flat_atomic 5411 - system 2. s_waitcnt vmcnt(0) & 5412 lgkmcnt(0) 5413 5414 - If OpenCL, omit 5415 lgkmcnt(0). 5416 - Must happen before 5417 following 5418 buffer_wbinvl1_vol. 5419 - Ensures the 5420 atomicrmw has 5421 completed before 5422 invalidating the 5423 cache. 5424 5425 3. buffer_wbinvl1_vol 5426 5427 - Must happen before 5428 any following 5429 global/generic 5430 load/load 5431 atomic/atomicrmw. 5432 - Ensures that 5433 following loads 5434 will not see stale 5435 global data. 5436 5437 fence acquire - singlethread *none* *none* 5438 - wavefront 5439 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5440 5441 - If OpenCL and 5442 address space is 5443 not generic, omit. 5444 - However, since LLVM 5445 currently has no 5446 address space on 5447 the fence need to 5448 conservatively 5449 always generate. If 5450 fence had an 5451 address space then 5452 set to address 5453 space of OpenCL 5454 fence flag, or to 5455 generic if both 5456 local and global 5457 flags are 5458 specified. 5459 - Must happen after 5460 any preceding 5461 local/generic load 5462 atomic/atomicrmw 5463 with an equal or 5464 wider sync scope 5465 and memory ordering 5466 stronger than 5467 unordered (this is 5468 termed the 5469 fence-paired-atomic). 5470 - Must happen before 5471 any following 5472 global/generic 5473 load/load 5474 atomic/store/store 5475 atomic/atomicrmw. 5476 - Ensures any 5477 following global 5478 data read is no 5479 older than the 5480 value read by the 5481 fence-paired-atomic. 5482 5483 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5484 - system vmcnt(0) 5485 5486 - If OpenCL and 5487 address space is 5488 not generic, omit 5489 lgkmcnt(0). 5490 - However, since LLVM 5491 currently has no 5492 address space on 5493 the fence need to 5494 conservatively 5495 always generate 5496 (see comment for 5497 previous fence). 5498 - Could be split into 5499 separate s_waitcnt 5500 vmcnt(0) and 5501 s_waitcnt 5502 lgkmcnt(0) to allow 5503 them to be 5504 independently moved 5505 according to the 5506 following rules. 5507 - s_waitcnt vmcnt(0) 5508 must happen after 5509 any preceding 5510 global/generic load 5511 atomic/atomicrmw 5512 with an equal or 5513 wider sync scope 5514 and memory ordering 5515 stronger than 5516 unordered (this is 5517 termed the 5518 fence-paired-atomic). 5519 - s_waitcnt lgkmcnt(0) 5520 must happen after 5521 any preceding 5522 local/generic load 5523 atomic/atomicrmw 5524 with an equal or 5525 wider sync scope 5526 and memory ordering 5527 stronger than 5528 unordered (this is 5529 termed the 5530 fence-paired-atomic). 5531 - Must happen before 5532 the following 5533 buffer_wbinvl1_vol. 5534 - Ensures that the 5535 fence-paired atomic 5536 has completed 5537 before invalidating 5538 the 5539 cache. Therefore 5540 any following 5541 locations read must 5542 be no older than 5543 the value read by 5544 the 5545 fence-paired-atomic. 5546 5547 2. buffer_wbinvl1_vol 5548 5549 - Must happen before any 5550 following global/generic 5551 load/load 5552 atomic/store/store 5553 atomic/atomicrmw. 5554 - Ensures that 5555 following loads 5556 will not see stale 5557 global data. 5558 5559 **Release Atomic** 5560 ------------------------------------------------------------------------------------ 5561 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5562 - wavefront - local 5563 - generic 5564 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5565 - generic 5566 - If OpenCL, omit. 5567 - Must happen after 5568 any preceding 5569 local/generic 5570 load/store/load 5571 atomic/store 5572 atomic/atomicrmw. 5573 - Must happen before 5574 the following 5575 store. 5576 - Ensures that all 5577 memory operations 5578 to local have 5579 completed before 5580 performing the 5581 store that is being 5582 released. 5583 5584 2. buffer/global/flat_store 5585 store atomic release - workgroup - local 1. ds_store 5586 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5587 - system - generic vmcnt(0) 5588 5589 - If OpenCL and 5590 address space is 5591 not generic, omit 5592 lgkmcnt(0). 5593 - Could be split into 5594 separate s_waitcnt 5595 vmcnt(0) and 5596 s_waitcnt 5597 lgkmcnt(0) to allow 5598 them to be 5599 independently moved 5600 according to the 5601 following rules. 5602 - s_waitcnt vmcnt(0) 5603 must happen after 5604 any preceding 5605 global/generic 5606 load/store/load 5607 atomic/store 5608 atomic/atomicrmw. 5609 - s_waitcnt lgkmcnt(0) 5610 must happen after 5611 any preceding 5612 local/generic 5613 load/store/load 5614 atomic/store 5615 atomic/atomicrmw. 5616 - Must happen before 5617 the following 5618 store. 5619 - Ensures that all 5620 memory operations 5621 to memory have 5622 completed before 5623 performing the 5624 store that is being 5625 released. 5626 5627 2. buffer/global/flat_store 5628 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5629 - wavefront - local 5630 - generic 5631 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5632 - generic 5633 - If OpenCL, omit. 5634 - Must happen after 5635 any preceding 5636 local/generic 5637 load/store/load 5638 atomic/store 5639 atomic/atomicrmw. 5640 - Must happen before 5641 the following 5642 atomicrmw. 5643 - Ensures that all 5644 memory operations 5645 to local have 5646 completed before 5647 performing the 5648 atomicrmw that is 5649 being released. 5650 5651 2. buffer/global/flat_atomic 5652 atomicrmw release - workgroup - local 1. ds_atomic 5653 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5654 - system - generic vmcnt(0) 5655 5656 - If OpenCL, omit 5657 lgkmcnt(0). 5658 - Could be split into 5659 separate s_waitcnt 5660 vmcnt(0) and 5661 s_waitcnt 5662 lgkmcnt(0) to allow 5663 them to be 5664 independently moved 5665 according to the 5666 following rules. 5667 - s_waitcnt vmcnt(0) 5668 must happen after 5669 any preceding 5670 global/generic 5671 load/store/load 5672 atomic/store 5673 atomic/atomicrmw. 5674 - s_waitcnt lgkmcnt(0) 5675 must happen after 5676 any preceding 5677 local/generic 5678 load/store/load 5679 atomic/store 5680 atomic/atomicrmw. 5681 - Must happen before 5682 the following 5683 atomicrmw. 5684 - Ensures that all 5685 memory operations 5686 to global and local 5687 have completed 5688 before performing 5689 the atomicrmw that 5690 is being released. 5691 5692 2. buffer/global/flat_atomic 5693 fence release - singlethread *none* *none* 5694 - wavefront 5695 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5696 5697 - If OpenCL and 5698 address space is 5699 not generic, omit. 5700 - However, since LLVM 5701 currently has no 5702 address space on 5703 the fence need to 5704 conservatively 5705 always generate. If 5706 fence had an 5707 address space then 5708 set to address 5709 space of OpenCL 5710 fence flag, or to 5711 generic if both 5712 local and global 5713 flags are 5714 specified. 5715 - Must happen after 5716 any preceding 5717 local/generic 5718 load/load 5719 atomic/store/store 5720 atomic/atomicrmw. 5721 - Must happen before 5722 any following store 5723 atomic/atomicrmw 5724 with an equal or 5725 wider sync scope 5726 and memory ordering 5727 stronger than 5728 unordered (this is 5729 termed the 5730 fence-paired-atomic). 5731 - Ensures that all 5732 memory operations 5733 to local have 5734 completed before 5735 performing the 5736 following 5737 fence-paired-atomic. 5738 5739 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5740 - system vmcnt(0) 5741 5742 - If OpenCL and 5743 address space is 5744 not generic, omit 5745 lgkmcnt(0). 5746 - If OpenCL and 5747 address space is 5748 local, omit 5749 vmcnt(0). 5750 - However, since LLVM 5751 currently has no 5752 address space on 5753 the fence need to 5754 conservatively 5755 always generate. If 5756 fence had an 5757 address space then 5758 set to address 5759 space of OpenCL 5760 fence flag, or to 5761 generic if both 5762 local and global 5763 flags are 5764 specified. 5765 - Could be split into 5766 separate s_waitcnt 5767 vmcnt(0) and 5768 s_waitcnt 5769 lgkmcnt(0) to allow 5770 them to be 5771 independently moved 5772 according to the 5773 following rules. 5774 - s_waitcnt vmcnt(0) 5775 must happen after 5776 any preceding 5777 global/generic 5778 load/store/load 5779 atomic/store 5780 atomic/atomicrmw. 5781 - s_waitcnt lgkmcnt(0) 5782 must happen after 5783 any preceding 5784 local/generic 5785 load/store/load 5786 atomic/store 5787 atomic/atomicrmw. 5788 - Must happen before 5789 any following store 5790 atomic/atomicrmw 5791 with an equal or 5792 wider sync scope 5793 and memory ordering 5794 stronger than 5795 unordered (this is 5796 termed the 5797 fence-paired-atomic). 5798 - Ensures that all 5799 memory operations 5800 have 5801 completed before 5802 performing the 5803 following 5804 fence-paired-atomic. 5805 5806 **Acquire-Release Atomic** 5807 ------------------------------------------------------------------------------------ 5808 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5809 - wavefront - local 5810 - generic 5811 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5812 5813 - If OpenCL, omit. 5814 - Must happen after 5815 any preceding 5816 local/generic 5817 load/store/load 5818 atomic/store 5819 atomic/atomicrmw. 5820 - Must happen before 5821 the following 5822 atomicrmw. 5823 - Ensures that all 5824 memory operations 5825 to local have 5826 completed before 5827 performing the 5828 atomicrmw that is 5829 being released. 5830 5831 2. buffer/global_atomic 5832 5833 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5834 2. s_waitcnt lgkmcnt(0) 5835 5836 - If OpenCL, omit. 5837 - Must happen before 5838 any following 5839 global/generic 5840 load/load 5841 atomic/store/store 5842 atomic/atomicrmw. 5843 - Ensures any 5844 following global 5845 data read is no 5846 older than the local load 5847 atomic value being 5848 acquired. 5849 5850 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5851 5852 - If OpenCL, omit. 5853 - Must happen after 5854 any preceding 5855 local/generic 5856 load/store/load 5857 atomic/store 5858 atomic/atomicrmw. 5859 - Must happen before 5860 the following 5861 atomicrmw. 5862 - Ensures that all 5863 memory operations 5864 to local have 5865 completed before 5866 performing the 5867 atomicrmw that is 5868 being released. 5869 5870 2. flat_atomic 5871 3. s_waitcnt lgkmcnt(0) 5872 5873 - If OpenCL, omit. 5874 - Must happen before 5875 any following 5876 global/generic 5877 load/load 5878 atomic/store/store 5879 atomic/atomicrmw. 5880 - Ensures any 5881 following global 5882 data read is no 5883 older than a local load 5884 atomic value being 5885 acquired. 5886 5887 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5888 - system vmcnt(0) 5889 5890 - If OpenCL, omit 5891 lgkmcnt(0). 5892 - Could be split into 5893 separate s_waitcnt 5894 vmcnt(0) and 5895 s_waitcnt 5896 lgkmcnt(0) to allow 5897 them to be 5898 independently moved 5899 according to the 5900 following rules. 5901 - s_waitcnt vmcnt(0) 5902 must happen after 5903 any preceding 5904 global/generic 5905 load/store/load 5906 atomic/store 5907 atomic/atomicrmw. 5908 - s_waitcnt lgkmcnt(0) 5909 must happen after 5910 any preceding 5911 local/generic 5912 load/store/load 5913 atomic/store 5914 atomic/atomicrmw. 5915 - Must happen before 5916 the following 5917 atomicrmw. 5918 - Ensures that all 5919 memory operations 5920 to global have 5921 completed before 5922 performing the 5923 atomicrmw that is 5924 being released. 5925 5926 2. buffer/global_atomic 5927 3. s_waitcnt vmcnt(0) 5928 5929 - Must happen before 5930 following 5931 buffer_wbinvl1_vol. 5932 - Ensures the 5933 atomicrmw has 5934 completed before 5935 invalidating the 5936 cache. 5937 5938 4. buffer_wbinvl1_vol 5939 5940 - Must happen before 5941 any following 5942 global/generic 5943 load/load 5944 atomic/atomicrmw. 5945 - Ensures that 5946 following loads 5947 will not see stale 5948 global data. 5949 5950 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5951 - system vmcnt(0) 5952 5953 - If OpenCL, omit 5954 lgkmcnt(0). 5955 - Could be split into 5956 separate s_waitcnt 5957 vmcnt(0) and 5958 s_waitcnt 5959 lgkmcnt(0) to allow 5960 them to be 5961 independently moved 5962 according to the 5963 following rules. 5964 - s_waitcnt vmcnt(0) 5965 must happen after 5966 any preceding 5967 global/generic 5968 load/store/load 5969 atomic/store 5970 atomic/atomicrmw. 5971 - s_waitcnt lgkmcnt(0) 5972 must happen after 5973 any preceding 5974 local/generic 5975 load/store/load 5976 atomic/store 5977 atomic/atomicrmw. 5978 - Must happen before 5979 the following 5980 atomicrmw. 5981 - Ensures that all 5982 memory operations 5983 to global have 5984 completed before 5985 performing the 5986 atomicrmw that is 5987 being released. 5988 5989 2. flat_atomic 5990 3. s_waitcnt vmcnt(0) & 5991 lgkmcnt(0) 5992 5993 - If OpenCL, omit 5994 lgkmcnt(0). 5995 - Must happen before 5996 following 5997 buffer_wbinvl1_vol. 5998 - Ensures the 5999 atomicrmw has 6000 completed before 6001 invalidating the 6002 cache. 6003 6004 4. buffer_wbinvl1_vol 6005 6006 - Must happen before 6007 any following 6008 global/generic 6009 load/load 6010 atomic/atomicrmw. 6011 - Ensures that 6012 following loads 6013 will not see stale 6014 global data. 6015 6016 fence acq_rel - singlethread *none* *none* 6017 - wavefront 6018 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 6019 6020 - If OpenCL and 6021 address space is 6022 not generic, omit. 6023 - However, 6024 since LLVM 6025 currently has no 6026 address space on 6027 the fence need to 6028 conservatively 6029 always generate 6030 (see comment for 6031 previous fence). 6032 - Must happen after 6033 any preceding 6034 local/generic 6035 load/load 6036 atomic/store/store 6037 atomic/atomicrmw. 6038 - Must happen before 6039 any following 6040 global/generic 6041 load/load 6042 atomic/store/store 6043 atomic/atomicrmw. 6044 - Ensures that all 6045 memory operations 6046 to local have 6047 completed before 6048 performing any 6049 following global 6050 memory operations. 6051 - Ensures that the 6052 preceding 6053 local/generic load 6054 atomic/atomicrmw 6055 with an equal or 6056 wider sync scope 6057 and memory ordering 6058 stronger than 6059 unordered (this is 6060 termed the 6061 acquire-fence-paired-atomic) 6062 has completed 6063 before following 6064 global memory 6065 operations. This 6066 satisfies the 6067 requirements of 6068 acquire. 6069 - Ensures that all 6070 previous memory 6071 operations have 6072 completed before a 6073 following 6074 local/generic store 6075 atomic/atomicrmw 6076 with an equal or 6077 wider sync scope 6078 and memory ordering 6079 stronger than 6080 unordered (this is 6081 termed the 6082 release-fence-paired-atomic). 6083 This satisfies the 6084 requirements of 6085 release. 6086 6087 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 6088 - system vmcnt(0) 6089 6090 - If OpenCL and 6091 address space is 6092 not generic, omit 6093 lgkmcnt(0). 6094 - However, since LLVM 6095 currently has no 6096 address space on 6097 the fence need to 6098 conservatively 6099 always generate 6100 (see comment for 6101 previous fence). 6102 - Could be split into 6103 separate s_waitcnt 6104 vmcnt(0) and 6105 s_waitcnt 6106 lgkmcnt(0) to allow 6107 them to be 6108 independently moved 6109 according to the 6110 following rules. 6111 - s_waitcnt vmcnt(0) 6112 must happen after 6113 any preceding 6114 global/generic 6115 load/store/load 6116 atomic/store 6117 atomic/atomicrmw. 6118 - s_waitcnt lgkmcnt(0) 6119 must happen after 6120 any preceding 6121 local/generic 6122 load/store/load 6123 atomic/store 6124 atomic/atomicrmw. 6125 - Must happen before 6126 the following 6127 buffer_wbinvl1_vol. 6128 - Ensures that the 6129 preceding 6130 global/local/generic 6131 load 6132 atomic/atomicrmw 6133 with an equal or 6134 wider sync scope 6135 and memory ordering 6136 stronger than 6137 unordered (this is 6138 termed the 6139 acquire-fence-paired-atomic) 6140 has completed 6141 before invalidating 6142 the cache. This 6143 satisfies the 6144 requirements of 6145 acquire. 6146 - Ensures that all 6147 previous memory 6148 operations have 6149 completed before a 6150 following 6151 global/local/generic 6152 store 6153 atomic/atomicrmw 6154 with an equal or 6155 wider sync scope 6156 and memory ordering 6157 stronger than 6158 unordered (this is 6159 termed the 6160 release-fence-paired-atomic). 6161 This satisfies the 6162 requirements of 6163 release. 6164 6165 2. buffer_wbinvl1_vol 6166 6167 - Must happen before 6168 any following 6169 global/generic 6170 load/load 6171 atomic/store/store 6172 atomic/atomicrmw. 6173 - Ensures that 6174 following loads 6175 will not see stale 6176 global data. This 6177 satisfies the 6178 requirements of 6179 acquire. 6180 6181 **Sequential Consistent Atomic** 6182 ------------------------------------------------------------------------------------ 6183 load atomic seq_cst - singlethread - global *Same as corresponding 6184 - wavefront - local load atomic acquire, 6185 - generic except must generate 6186 all instructions even 6187 for OpenCL.* 6188 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 6189 - generic 6190 6191 - Must 6192 happen after 6193 preceding 6194 local/generic load 6195 atomic/store 6196 atomic/atomicrmw 6197 with memory 6198 ordering of seq_cst 6199 and with equal or 6200 wider sync scope. 6201 (Note that seq_cst 6202 fences have their 6203 own s_waitcnt 6204 lgkmcnt(0) and so do 6205 not need to be 6206 considered.) 6207 - Ensures any 6208 preceding 6209 sequential 6210 consistent local 6211 memory instructions 6212 have completed 6213 before executing 6214 this sequentially 6215 consistent 6216 instruction. This 6217 prevents reordering 6218 a seq_cst store 6219 followed by a 6220 seq_cst load. (Note 6221 that seq_cst is 6222 stronger than 6223 acquire/release as 6224 the reordering of 6225 load acquire 6226 followed by a store 6227 release is 6228 prevented by the 6229 s_waitcnt of 6230 the release, but 6231 there is nothing 6232 preventing a store 6233 release followed by 6234 load acquire from 6235 completing out of 6236 order. The s_waitcnt 6237 could be placed after 6238 seq_store or before 6239 the seq_load. We 6240 choose the load to 6241 make the s_waitcnt be 6242 as late as possible 6243 so that the store 6244 may have already 6245 completed.) 6246 6247 2. *Following 6248 instructions same as 6249 corresponding load 6250 atomic acquire, 6251 except must generate 6252 all instructions even 6253 for OpenCL.* 6254 load atomic seq_cst - workgroup - local *Same as corresponding 6255 load atomic acquire, 6256 except must generate 6257 all instructions even 6258 for OpenCL.* 6259 6260 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 6261 - system - generic vmcnt(0) 6262 6263 - Could be split into 6264 separate s_waitcnt 6265 vmcnt(0) 6266 and s_waitcnt 6267 lgkmcnt(0) to allow 6268 them to be 6269 independently moved 6270 according to the 6271 following rules. 6272 - s_waitcnt lgkmcnt(0) 6273 must happen after 6274 preceding 6275 global/generic load 6276 atomic/store 6277 atomic/atomicrmw 6278 with memory 6279 ordering of seq_cst 6280 and with equal or 6281 wider sync scope. 6282 (Note that seq_cst 6283 fences have their 6284 own s_waitcnt 6285 lgkmcnt(0) and so do 6286 not need to be 6287 considered.) 6288 - s_waitcnt vmcnt(0) 6289 must happen after 6290 preceding 6291 global/generic load 6292 atomic/store 6293 atomic/atomicrmw 6294 with memory 6295 ordering of seq_cst 6296 and with equal or 6297 wider sync scope. 6298 (Note that seq_cst 6299 fences have their 6300 own s_waitcnt 6301 vmcnt(0) and so do 6302 not need to be 6303 considered.) 6304 - Ensures any 6305 preceding 6306 sequential 6307 consistent global 6308 memory instructions 6309 have completed 6310 before executing 6311 this sequentially 6312 consistent 6313 instruction. This 6314 prevents reordering 6315 a seq_cst store 6316 followed by a 6317 seq_cst load. (Note 6318 that seq_cst is 6319 stronger than 6320 acquire/release as 6321 the reordering of 6322 load acquire 6323 followed by a store 6324 release is 6325 prevented by the 6326 s_waitcnt of 6327 the release, but 6328 there is nothing 6329 preventing a store 6330 release followed by 6331 load acquire from 6332 completing out of 6333 order. The s_waitcnt 6334 could be placed after 6335 seq_store or before 6336 the seq_load. We 6337 choose the load to 6338 make the s_waitcnt be 6339 as late as possible 6340 so that the store 6341 may have already 6342 completed.) 6343 6344 2. *Following 6345 instructions same as 6346 corresponding load 6347 atomic acquire, 6348 except must generate 6349 all instructions even 6350 for OpenCL.* 6351 store atomic seq_cst - singlethread - global *Same as corresponding 6352 - wavefront - local store atomic release, 6353 - workgroup - generic except must generate 6354 - agent all instructions even 6355 - system for OpenCL.* 6356 atomicrmw seq_cst - singlethread - global *Same as corresponding 6357 - wavefront - local atomicrmw acq_rel, 6358 - workgroup - generic except must generate 6359 - agent all instructions even 6360 - system for OpenCL.* 6361 fence seq_cst - singlethread *none* *Same as corresponding 6362 - wavefront fence acq_rel, 6363 - workgroup except must generate 6364 - agent all instructions even 6365 - system for OpenCL.* 6366 ============ ============ ============== ========== ================================ 6367 6368.. _amdgpu-amdhsa-memory-model-gfx90a: 6369 6370Memory Model GFX90A 6371+++++++++++++++++++ 6372 6373For GFX90A: 6374 6375* Each agent has multiple shader arrays (SA). 6376* Each SA has multiple compute units (CU). 6377* Each CU has multiple SIMDs that execute wavefronts. 6378* The wavefronts for a single work-group are executed in the same CU but may be 6379 executed by different SIMDs. The exception is when in tgsplit execution mode 6380 when the wavefronts may be executed by different SIMDs in different CUs. 6381* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6382 executing on it. The exception is when in tgsplit execution mode when no LDS 6383 is allocated as wavefronts of the same work-group can be in different CUs. 6384* All LDS operations of a CU are performed as wavefront wide operations in a 6385 global order and involve no caching. Completion is reported to a wavefront in 6386 execution order. 6387* The LDS memory has multiple request queues shared by the SIMDs of a 6388 CU. Therefore, the LDS operations performed by different wavefronts of a 6389 work-group can be reordered relative to each other, which can result in 6390 reordering the visibility of vector memory operations with respect to LDS 6391 operations of other wavefronts in the same work-group. A ``s_waitcnt 6392 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6393 vector memory operations between wavefronts of a work-group, but not between 6394 operations performed by the same wavefront. 6395* The vector memory operations are performed as wavefront wide operations and 6396 completion is reported to a wavefront in execution order. The exception is 6397 that ``flat_load/store/atomic`` instructions can report out of vector memory 6398 order if they access LDS memory, and out of LDS operation order if they access 6399 global memory. 6400* The vector memory operations access a single vector L1 cache shared by all 6401 SIMDs a CU. Therefore: 6402 6403 * No special action is required for coherence between the lanes of a single 6404 wavefront. 6405 6406 * No special action is required for coherence between wavefronts in the same 6407 work-group since they execute on the same CU. The exception is when in 6408 tgsplit execution mode as wavefronts of the same work-group can be in 6409 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6410 the following item. 6411 6412 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6413 executing in different work-groups as they may be executing on different 6414 CUs. 6415 6416* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6417 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6418 scalar operations are used in a restricted way so do not impact the memory 6419 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6420* The vector and scalar memory operations use an L2 cache shared by all CUs on 6421 the same agent. 6422 6423 * The L2 cache has independent channels to service disjoint ranges of virtual 6424 addresses. 6425 * Each CU has a separate request queue per channel. Therefore, the vector and 6426 scalar memory operations performed by wavefronts executing in different 6427 work-groups (which may be executing on different CUs), or the same 6428 work-group if executing in tgsplit mode, of an agent can be reordered 6429 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6430 synchronization between vector memory operations of different CUs. It 6431 ensures a previous vector memory operation has completed before executing a 6432 subsequent vector memory or LDS operation and so can be used to meet the 6433 requirements of acquire and release. 6434 * The L2 cache of one agent can be kept coherent with other agents by: 6435 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 6436 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 6437 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 6438 6439 * Any local memory cache lines will be automatically invalidated by writes 6440 from CUs associated with other L2 caches, or writes from the CPU, due to 6441 the cache probe caused by coherent requests. Coherent requests are caused 6442 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6443 XGMI, and by PCIe requests that are configured to be coherent requests. 6444 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6445 Subsequent access from the GPU will automatically invalidate or writeback 6446 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6447 * Since all work-groups on the same agent share the same L2, no L2 6448 invalidation or writeback is required for coherence. 6449 * To ensure coherence of local and remote memory writes of work-groups in 6450 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 6451 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 6452 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 6453 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 6454 remote fine grain memory) bypasses the L2, so both will never result in 6455 dirty L2 cache lines. 6456 * To ensure coherence of local and remote memory reads of work-groups in 6457 different agents a ``buffer_invl2`` is required. It will invalidate L2 6458 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 6459 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 6460 coarse memory) cause local reads to be invalidated by remote writes with 6461 with the PTE C-bit so these cache lines are not invalidated. Note that 6462 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 6463 never result in L2 cache lines that need to be invalidated. 6464 6465 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6466 MTYPE UC (uncached) which bypasses the L2. 6467 6468Scalar memory operations are only used to access memory that is proven to not 6469change during the execution of the kernel dispatch. This includes constant 6470address space and global address space for program scope ``const`` variables. 6471Therefore, the kernel machine code does not have to maintain the scalar cache to 6472ensure it is coherent with the vector caches. The scalar and vector caches are 6473invalidated between kernel dispatches by CP since constant address space data 6474may change between kernel dispatch executions. See 6475:ref:`amdgpu-amdhsa-memory-spaces`. 6476 6477The one exception is if scalar writes are used to spill SGPR registers. In this 6478case the AMDGPU backend ensures the memory location used to spill is never 6479accessed by vector memory operations at the same time. If scalar writes are used 6480then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6481return since the locations may be used for vector memory instructions by a 6482future wavefront that uses the same scratch area, or a function call that 6483creates a frame at the same address, respectively. There is no need for a 6484``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6485 6486For kernarg backing memory: 6487 6488* CP invalidates the L1 cache at the start of each kernel dispatch. 6489* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6490 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6491 cache. This also causes it to be treated as non-volatile and so is not 6492 invalidated by ``*_vol``. 6493* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6494 so the L2 cache will be coherent with the CPU and other agents. 6495 6496Scratch backing memory (which is used for the private address space) is accessed 6497with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6498only accessed by a single thread, and is always write-before-read, there is 6499never a need to invalidate these entries from the L1 cache. Hence all cache 6500invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6501 6502The code sequences used to implement the memory model for GFX90A are defined 6503in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6504 6505 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6506 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6507 6508 ============ ============ ============== ========== ================================ 6509 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6510 Ordering Sync Scope Address GFX90A 6511 Space 6512 ============ ============ ============== ========== ================================ 6513 **Non-Atomic** 6514 ------------------------------------------------------------------------------------ 6515 load *none* *none* - global - !volatile & !nontemporal 6516 - generic 6517 - private 1. buffer/global/flat_load 6518 - constant 6519 - !volatile & nontemporal 6520 6521 1. buffer/global/flat_load 6522 glc=1 slc=1 6523 6524 - volatile 6525 6526 1. buffer/global/flat_load 6527 glc=1 6528 2. s_waitcnt vmcnt(0) 6529 6530 - Must happen before 6531 any following volatile 6532 global/generic 6533 load/store. 6534 - Ensures that 6535 volatile 6536 operations to 6537 different 6538 addresses will not 6539 be reordered by 6540 hardware. 6541 6542 load *none* *none* - local 1. ds_load 6543 store *none* *none* - global - !volatile & !nontemporal 6544 - generic 6545 - private 1. buffer/global/flat_store 6546 - constant 6547 - !volatile & nontemporal 6548 6549 1. buffer/global/flat_store 6550 glc=1 slc=1 6551 6552 - volatile 6553 6554 1. buffer/global/flat_store 6555 2. s_waitcnt vmcnt(0) 6556 6557 - Must happen before 6558 any following volatile 6559 global/generic 6560 load/store. 6561 - Ensures that 6562 volatile 6563 operations to 6564 different 6565 addresses will not 6566 be reordered by 6567 hardware. 6568 6569 store *none* *none* - local 1. ds_store 6570 **Unordered Atomic** 6571 ------------------------------------------------------------------------------------ 6572 load atomic unordered *any* *any* *Same as non-atomic*. 6573 store atomic unordered *any* *any* *Same as non-atomic*. 6574 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6575 **Monotonic Atomic** 6576 ------------------------------------------------------------------------------------ 6577 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6578 - wavefront - generic 6579 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6580 - generic glc=1 6581 6582 - If not TgSplit execution 6583 mode, omit glc=1. 6584 6585 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6586 - wavefront local address space cannot 6587 - workgroup be used.* 6588 6589 1. ds_load 6590 load atomic monotonic - agent - global 1. buffer/global/flat_load 6591 - generic glc=1 6592 load atomic monotonic - system - global 1. buffer/global/flat_load 6593 - generic glc=1 6594 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6595 - wavefront - generic 6596 - workgroup 6597 - agent 6598 store atomic monotonic - system - global 1. buffer/global/flat_store 6599 - generic 6600 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6601 - wavefront local address space cannot 6602 - workgroup be used.* 6603 6604 1. ds_store 6605 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6606 - wavefront - generic 6607 - workgroup 6608 - agent 6609 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6610 - generic 6611 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6612 - wavefront local address space cannot 6613 - workgroup be used.* 6614 6615 1. ds_atomic 6616 **Acquire Atomic** 6617 ------------------------------------------------------------------------------------ 6618 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6619 - wavefront - local 6620 - generic 6621 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6622 6623 - If not TgSplit execution 6624 mode, omit glc=1. 6625 6626 2. s_waitcnt vmcnt(0) 6627 6628 - If not TgSplit execution 6629 mode, omit. 6630 - Must happen before the 6631 following buffer_wbinvl1_vol. 6632 6633 3. buffer_wbinvl1_vol 6634 6635 - If not TgSplit execution 6636 mode, omit. 6637 - Must happen before 6638 any following 6639 global/generic 6640 load/load 6641 atomic/store/store 6642 atomic/atomicrmw. 6643 - Ensures that 6644 following 6645 loads will not see 6646 stale data. 6647 6648 load atomic acquire - workgroup - local *If TgSplit execution mode, 6649 local address space cannot 6650 be used.* 6651 6652 1. ds_load 6653 2. s_waitcnt lgkmcnt(0) 6654 6655 - If OpenCL, omit. 6656 - Must happen before 6657 any following 6658 global/generic 6659 load/load 6660 atomic/store/store 6661 atomic/atomicrmw. 6662 - Ensures any 6663 following global 6664 data read is no 6665 older than the local load 6666 atomic value being 6667 acquired. 6668 6669 load atomic acquire - workgroup - generic 1. flat_load glc=1 6670 6671 - If not TgSplit execution 6672 mode, omit glc=1. 6673 6674 2. s_waitcnt lgkm/vmcnt(0) 6675 6676 - Use lgkmcnt(0) if not 6677 TgSplit execution mode 6678 and vmcnt(0) if TgSplit 6679 execution mode. 6680 - If OpenCL, omit lgkmcnt(0). 6681 - Must happen before 6682 the following 6683 buffer_wbinvl1_vol and any 6684 following global/generic 6685 load/load 6686 atomic/store/store 6687 atomic/atomicrmw. 6688 - Ensures any 6689 following global 6690 data read is no 6691 older than a local load 6692 atomic value being 6693 acquired. 6694 6695 3. buffer_wbinvl1_vol 6696 6697 - If not TgSplit execution 6698 mode, omit. 6699 - Ensures that 6700 following 6701 loads will not see 6702 stale data. 6703 6704 load atomic acquire - agent - global 1. buffer/global_load 6705 glc=1 6706 2. s_waitcnt vmcnt(0) 6707 6708 - Must happen before 6709 following 6710 buffer_wbinvl1_vol. 6711 - Ensures the load 6712 has completed 6713 before invalidating 6714 the cache. 6715 6716 3. buffer_wbinvl1_vol 6717 6718 - Must happen before 6719 any following 6720 global/generic 6721 load/load 6722 atomic/atomicrmw. 6723 - Ensures that 6724 following 6725 loads will not see 6726 stale global data. 6727 6728 load atomic acquire - system - global 1. buffer/global/flat_load 6729 glc=1 6730 2. s_waitcnt vmcnt(0) 6731 6732 - Must happen before 6733 following buffer_invl2 and 6734 buffer_wbinvl1_vol. 6735 - Ensures the load 6736 has completed 6737 before invalidating 6738 the cache. 6739 6740 3. buffer_invl2; 6741 buffer_wbinvl1_vol 6742 6743 - Must happen before 6744 any following 6745 global/generic 6746 load/load 6747 atomic/atomicrmw. 6748 - Ensures that 6749 following 6750 loads will not see 6751 stale L1 global data, 6752 nor see stale L2 MTYPE 6753 NC global data. 6754 MTYPE RW and CC memory will 6755 never be stale in L2 due to 6756 the memory probes. 6757 6758 load atomic acquire - agent - generic 1. flat_load glc=1 6759 2. s_waitcnt vmcnt(0) & 6760 lgkmcnt(0) 6761 6762 - If TgSplit execution mode, 6763 omit lgkmcnt(0). 6764 - If OpenCL omit 6765 lgkmcnt(0). 6766 - Must happen before 6767 following 6768 buffer_wbinvl1_vol. 6769 - Ensures the flat_load 6770 has completed 6771 before invalidating 6772 the cache. 6773 6774 3. buffer_wbinvl1_vol 6775 6776 - Must happen before 6777 any following 6778 global/generic 6779 load/load 6780 atomic/atomicrmw. 6781 - Ensures that 6782 following loads 6783 will not see stale 6784 global data. 6785 6786 load atomic acquire - system - generic 1. flat_load glc=1 6787 2. s_waitcnt vmcnt(0) & 6788 lgkmcnt(0) 6789 6790 - If TgSplit execution mode, 6791 omit lgkmcnt(0). 6792 - If OpenCL omit 6793 lgkmcnt(0). 6794 - Must happen before 6795 following 6796 buffer_invl2 and 6797 buffer_wbinvl1_vol. 6798 - Ensures the flat_load 6799 has completed 6800 before invalidating 6801 the caches. 6802 6803 3. buffer_invl2; 6804 buffer_wbinvl1_vol 6805 6806 - Must happen before 6807 any following 6808 global/generic 6809 load/load 6810 atomic/atomicrmw. 6811 - Ensures that 6812 following 6813 loads will not see 6814 stale L1 global data, 6815 nor see stale L2 MTYPE 6816 NC global data. 6817 MTYPE RW and CC memory will 6818 never be stale in L2 due to 6819 the memory probes. 6820 6821 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6822 - wavefront - generic 6823 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6824 - wavefront local address space cannot 6825 be used.* 6826 6827 1. ds_atomic 6828 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6829 2. s_waitcnt vmcnt(0) 6830 6831 - If not TgSplit execution 6832 mode, omit. 6833 - Must happen before the 6834 following buffer_wbinvl1_vol. 6835 - Ensures the atomicrmw 6836 has completed 6837 before invalidating 6838 the cache. 6839 6840 3. buffer_wbinvl1_vol 6841 6842 - If not TgSplit execution 6843 mode, omit. 6844 - Must happen before 6845 any following 6846 global/generic 6847 load/load 6848 atomic/atomicrmw. 6849 - Ensures that 6850 following loads 6851 will not see stale 6852 global data. 6853 6854 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6855 local address space cannot 6856 be used.* 6857 6858 1. ds_atomic 6859 2. s_waitcnt lgkmcnt(0) 6860 6861 - If OpenCL, omit. 6862 - Must happen before 6863 any following 6864 global/generic 6865 load/load 6866 atomic/store/store 6867 atomic/atomicrmw. 6868 - Ensures any 6869 following global 6870 data read is no 6871 older than the local 6872 atomicrmw value 6873 being acquired. 6874 6875 atomicrmw acquire - workgroup - generic 1. flat_atomic 6876 2. s_waitcnt lgkm/vmcnt(0) 6877 6878 - Use lgkmcnt(0) if not 6879 TgSplit execution mode 6880 and vmcnt(0) if TgSplit 6881 execution mode. 6882 - If OpenCL, omit lgkmcnt(0). 6883 - Must happen before 6884 the following 6885 buffer_wbinvl1_vol and 6886 any following 6887 global/generic 6888 load/load 6889 atomic/store/store 6890 atomic/atomicrmw. 6891 - Ensures any 6892 following global 6893 data read is no 6894 older than a local 6895 atomicrmw value 6896 being acquired. 6897 6898 3. buffer_wbinvl1_vol 6899 6900 - If not TgSplit execution 6901 mode, omit. 6902 - Ensures that 6903 following 6904 loads will not see 6905 stale data. 6906 6907 atomicrmw acquire - agent - global 1. buffer/global_atomic 6908 2. s_waitcnt vmcnt(0) 6909 6910 - Must happen before 6911 following 6912 buffer_wbinvl1_vol. 6913 - Ensures the 6914 atomicrmw has 6915 completed before 6916 invalidating the 6917 cache. 6918 6919 3. buffer_wbinvl1_vol 6920 6921 - Must happen before 6922 any following 6923 global/generic 6924 load/load 6925 atomic/atomicrmw. 6926 - Ensures that 6927 following loads 6928 will not see stale 6929 global data. 6930 6931 atomicrmw acquire - system - global 1. buffer/global_atomic 6932 2. s_waitcnt vmcnt(0) 6933 6934 - Must happen before 6935 following buffer_invl2 and 6936 buffer_wbinvl1_vol. 6937 - Ensures the 6938 atomicrmw has 6939 completed before 6940 invalidating the 6941 caches. 6942 6943 3. buffer_invl2; 6944 buffer_wbinvl1_vol 6945 6946 - Must happen before 6947 any following 6948 global/generic 6949 load/load 6950 atomic/atomicrmw. 6951 - Ensures that 6952 following 6953 loads will not see 6954 stale L1 global data, 6955 nor see stale L2 MTYPE 6956 NC global data. 6957 MTYPE RW and CC memory will 6958 never be stale in L2 due to 6959 the memory probes. 6960 6961 atomicrmw acquire - agent - generic 1. flat_atomic 6962 2. s_waitcnt vmcnt(0) & 6963 lgkmcnt(0) 6964 6965 - If TgSplit execution mode, 6966 omit lgkmcnt(0). 6967 - If OpenCL, omit 6968 lgkmcnt(0). 6969 - Must happen before 6970 following 6971 buffer_wbinvl1_vol. 6972 - Ensures the 6973 atomicrmw has 6974 completed before 6975 invalidating the 6976 cache. 6977 6978 3. buffer_wbinvl1_vol 6979 6980 - Must happen before 6981 any following 6982 global/generic 6983 load/load 6984 atomic/atomicrmw. 6985 - Ensures that 6986 following loads 6987 will not see stale 6988 global data. 6989 6990 atomicrmw acquire - system - generic 1. flat_atomic 6991 2. s_waitcnt vmcnt(0) & 6992 lgkmcnt(0) 6993 6994 - If TgSplit execution mode, 6995 omit lgkmcnt(0). 6996 - If OpenCL, omit 6997 lgkmcnt(0). 6998 - Must happen before 6999 following 7000 buffer_invl2 and 7001 buffer_wbinvl1_vol. 7002 - Ensures the 7003 atomicrmw has 7004 completed before 7005 invalidating the 7006 caches. 7007 7008 3. buffer_invl2; 7009 buffer_wbinvl1_vol 7010 7011 - Must happen before 7012 any following 7013 global/generic 7014 load/load 7015 atomic/atomicrmw. 7016 - Ensures that 7017 following 7018 loads will not see 7019 stale L1 global data, 7020 nor see stale L2 MTYPE 7021 NC global data. 7022 MTYPE RW and CC memory will 7023 never be stale in L2 due to 7024 the memory probes. 7025 7026 fence acquire - singlethread *none* *none* 7027 - wavefront 7028 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7029 7030 - Use lgkmcnt(0) if not 7031 TgSplit execution mode 7032 and vmcnt(0) if TgSplit 7033 execution mode. 7034 - If OpenCL and 7035 address space is 7036 not generic, omit 7037 lgkmcnt(0). 7038 - If OpenCL and 7039 address space is 7040 local, omit 7041 vmcnt(0). 7042 - However, since LLVM 7043 currently has no 7044 address space on 7045 the fence need to 7046 conservatively 7047 always generate. If 7048 fence had an 7049 address space then 7050 set to address 7051 space of OpenCL 7052 fence flag, or to 7053 generic if both 7054 local and global 7055 flags are 7056 specified. 7057 - s_waitcnt vmcnt(0) 7058 must happen after 7059 any preceding 7060 global/generic load 7061 atomic/ 7062 atomicrmw 7063 with an equal or 7064 wider sync scope 7065 and memory ordering 7066 stronger than 7067 unordered (this is 7068 termed the 7069 fence-paired-atomic). 7070 - s_waitcnt lgkmcnt(0) 7071 must happen after 7072 any preceding 7073 local/generic load 7074 atomic/atomicrmw 7075 with an equal or 7076 wider sync scope 7077 and memory ordering 7078 stronger than 7079 unordered (this is 7080 termed the 7081 fence-paired-atomic). 7082 - Must happen before 7083 the following 7084 buffer_wbinvl1_vol and 7085 any following 7086 global/generic 7087 load/load 7088 atomic/store/store 7089 atomic/atomicrmw. 7090 - Ensures any 7091 following global 7092 data read is no 7093 older than the 7094 value read by the 7095 fence-paired-atomic. 7096 7097 2. buffer_wbinvl1_vol 7098 7099 - If not TgSplit execution 7100 mode, omit. 7101 - Ensures that 7102 following 7103 loads will not see 7104 stale data. 7105 7106 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 7107 vmcnt(0) 7108 7109 - If TgSplit execution mode, 7110 omit lgkmcnt(0). 7111 - If OpenCL and 7112 address space is 7113 not generic, omit 7114 lgkmcnt(0). 7115 - However, since LLVM 7116 currently has no 7117 address space on 7118 the fence need to 7119 conservatively 7120 always generate 7121 (see comment for 7122 previous fence). 7123 - Could be split into 7124 separate s_waitcnt 7125 vmcnt(0) and 7126 s_waitcnt 7127 lgkmcnt(0) to allow 7128 them to be 7129 independently moved 7130 according to the 7131 following rules. 7132 - s_waitcnt vmcnt(0) 7133 must happen after 7134 any preceding 7135 global/generic load 7136 atomic/atomicrmw 7137 with an equal or 7138 wider sync scope 7139 and memory ordering 7140 stronger than 7141 unordered (this is 7142 termed the 7143 fence-paired-atomic). 7144 - s_waitcnt lgkmcnt(0) 7145 must happen after 7146 any preceding 7147 local/generic load 7148 atomic/atomicrmw 7149 with an equal or 7150 wider sync scope 7151 and memory ordering 7152 stronger than 7153 unordered (this is 7154 termed the 7155 fence-paired-atomic). 7156 - Must happen before 7157 the following 7158 buffer_wbinvl1_vol. 7159 - Ensures that the 7160 fence-paired atomic 7161 has completed 7162 before invalidating 7163 the 7164 cache. Therefore 7165 any following 7166 locations read must 7167 be no older than 7168 the value read by 7169 the 7170 fence-paired-atomic. 7171 7172 2. buffer_wbinvl1_vol 7173 7174 - Must happen before any 7175 following global/generic 7176 load/load 7177 atomic/store/store 7178 atomic/atomicrmw. 7179 - Ensures that 7180 following loads 7181 will not see stale 7182 global data. 7183 7184 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 7185 vmcnt(0) 7186 7187 - If TgSplit execution mode, 7188 omit lgkmcnt(0). 7189 - If OpenCL and 7190 address space is 7191 not generic, omit 7192 lgkmcnt(0). 7193 - However, since LLVM 7194 currently has no 7195 address space on 7196 the fence need to 7197 conservatively 7198 always generate 7199 (see comment for 7200 previous fence). 7201 - Could be split into 7202 separate s_waitcnt 7203 vmcnt(0) and 7204 s_waitcnt 7205 lgkmcnt(0) to allow 7206 them to be 7207 independently moved 7208 according to the 7209 following rules. 7210 - s_waitcnt vmcnt(0) 7211 must happen after 7212 any preceding 7213 global/generic load 7214 atomic/atomicrmw 7215 with an equal or 7216 wider sync scope 7217 and memory ordering 7218 stronger than 7219 unordered (this is 7220 termed the 7221 fence-paired-atomic). 7222 - s_waitcnt lgkmcnt(0) 7223 must happen after 7224 any preceding 7225 local/generic load 7226 atomic/atomicrmw 7227 with an equal or 7228 wider sync scope 7229 and memory ordering 7230 stronger than 7231 unordered (this is 7232 termed the 7233 fence-paired-atomic). 7234 - Must happen before 7235 the following buffer_invl2 and 7236 buffer_wbinvl1_vol. 7237 - Ensures that the 7238 fence-paired atomic 7239 has completed 7240 before invalidating 7241 the 7242 cache. Therefore 7243 any following 7244 locations read must 7245 be no older than 7246 the value read by 7247 the 7248 fence-paired-atomic. 7249 7250 2. buffer_invl2; 7251 buffer_wbinvl1_vol 7252 7253 - Must happen before any 7254 following global/generic 7255 load/load 7256 atomic/store/store 7257 atomic/atomicrmw. 7258 - Ensures that 7259 following 7260 loads will not see 7261 stale L1 global data, 7262 nor see stale L2 MTYPE 7263 NC global data. 7264 MTYPE RW and CC memory will 7265 never be stale in L2 due to 7266 the memory probes. 7267 **Release Atomic** 7268 ------------------------------------------------------------------------------------ 7269 store atomic release - singlethread - global 1. buffer/global/flat_store 7270 - wavefront - generic 7271 store atomic release - singlethread - local *If TgSplit execution mode, 7272 - wavefront local address space cannot 7273 be used.* 7274 7275 1. ds_store 7276 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7277 - generic 7278 - Use lgkmcnt(0) if not 7279 TgSplit execution mode 7280 and vmcnt(0) if TgSplit 7281 execution mode. 7282 - If OpenCL, omit lgkmcnt(0). 7283 - s_waitcnt vmcnt(0) 7284 must happen after 7285 any preceding 7286 global/generic load/store/ 7287 load atomic/store atomic/ 7288 atomicrmw. 7289 - s_waitcnt lgkmcnt(0) 7290 must happen after 7291 any preceding 7292 local/generic 7293 load/store/load 7294 atomic/store 7295 atomic/atomicrmw. 7296 - Must happen before 7297 the following 7298 store. 7299 - Ensures that all 7300 memory operations 7301 have 7302 completed before 7303 performing the 7304 store that is being 7305 released. 7306 7307 2. buffer/global/flat_store 7308 store atomic release - workgroup - local *If TgSplit execution mode, 7309 local address space cannot 7310 be used.* 7311 7312 1. ds_store 7313 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 7314 - generic vmcnt(0) 7315 7316 - If TgSplit execution mode, 7317 omit lgkmcnt(0). 7318 - If OpenCL and 7319 address space is 7320 not generic, omit 7321 lgkmcnt(0). 7322 - Could be split into 7323 separate s_waitcnt 7324 vmcnt(0) and 7325 s_waitcnt 7326 lgkmcnt(0) to allow 7327 them to be 7328 independently moved 7329 according to the 7330 following rules. 7331 - s_waitcnt vmcnt(0) 7332 must happen after 7333 any preceding 7334 global/generic 7335 load/store/load 7336 atomic/store 7337 atomic/atomicrmw. 7338 - s_waitcnt lgkmcnt(0) 7339 must happen after 7340 any preceding 7341 local/generic 7342 load/store/load 7343 atomic/store 7344 atomic/atomicrmw. 7345 - Must happen before 7346 the following 7347 store. 7348 - Ensures that all 7349 memory operations 7350 to memory have 7351 completed before 7352 performing the 7353 store that is being 7354 released. 7355 7356 2. buffer/global/flat_store 7357 store atomic release - system - global 1. buffer_wbl2 7358 - generic 7359 - Must happen before 7360 following s_waitcnt. 7361 - Performs L2 writeback to 7362 ensure previous 7363 global/generic 7364 store/atomicrmw are 7365 visible at system scope. 7366 7367 2. s_waitcnt lgkmcnt(0) & 7368 vmcnt(0) 7369 7370 - If TgSplit execution mode, 7371 omit lgkmcnt(0). 7372 - If OpenCL and 7373 address space is 7374 not generic, omit 7375 lgkmcnt(0). 7376 - Could be split into 7377 separate s_waitcnt 7378 vmcnt(0) and 7379 s_waitcnt 7380 lgkmcnt(0) to allow 7381 them to be 7382 independently moved 7383 according to the 7384 following rules. 7385 - s_waitcnt vmcnt(0) 7386 must happen after any 7387 preceding 7388 global/generic 7389 load/store/load 7390 atomic/store 7391 atomic/atomicrmw. 7392 - s_waitcnt lgkmcnt(0) 7393 must happen after any 7394 preceding 7395 local/generic 7396 load/store/load 7397 atomic/store 7398 atomic/atomicrmw. 7399 - Must happen before 7400 the following 7401 store. 7402 - Ensures that all 7403 memory operations 7404 to memory and the L2 7405 writeback have 7406 completed before 7407 performing the 7408 store that is being 7409 released. 7410 7411 3. buffer/global/flat_store 7412 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7413 - wavefront - generic 7414 atomicrmw release - singlethread - local *If TgSplit execution mode, 7415 - wavefront local address space cannot 7416 be used.* 7417 7418 1. ds_atomic 7419 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7420 - generic 7421 - Use lgkmcnt(0) if not 7422 TgSplit execution mode 7423 and vmcnt(0) if TgSplit 7424 execution mode. 7425 - If OpenCL, omit 7426 lgkmcnt(0). 7427 - s_waitcnt vmcnt(0) 7428 must happen after 7429 any preceding 7430 global/generic load/store/ 7431 load atomic/store atomic/ 7432 atomicrmw. 7433 - s_waitcnt lgkmcnt(0) 7434 must happen after 7435 any preceding 7436 local/generic 7437 load/store/load 7438 atomic/store 7439 atomic/atomicrmw. 7440 - Must happen before 7441 the following 7442 atomicrmw. 7443 - Ensures that all 7444 memory operations 7445 have 7446 completed before 7447 performing the 7448 atomicrmw that is 7449 being released. 7450 7451 2. buffer/global/flat_atomic 7452 atomicrmw release - workgroup - local *If TgSplit execution mode, 7453 local address space cannot 7454 be used.* 7455 7456 1. ds_atomic 7457 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7458 - generic vmcnt(0) 7459 7460 - If TgSplit execution mode, 7461 omit lgkmcnt(0). 7462 - If OpenCL, omit 7463 lgkmcnt(0). 7464 - Could be split into 7465 separate s_waitcnt 7466 vmcnt(0) and 7467 s_waitcnt 7468 lgkmcnt(0) to allow 7469 them to be 7470 independently moved 7471 according to the 7472 following rules. 7473 - s_waitcnt vmcnt(0) 7474 must happen after 7475 any preceding 7476 global/generic 7477 load/store/load 7478 atomic/store 7479 atomic/atomicrmw. 7480 - s_waitcnt lgkmcnt(0) 7481 must happen after 7482 any preceding 7483 local/generic 7484 load/store/load 7485 atomic/store 7486 atomic/atomicrmw. 7487 - Must happen before 7488 the following 7489 atomicrmw. 7490 - Ensures that all 7491 memory operations 7492 to global and local 7493 have completed 7494 before performing 7495 the atomicrmw that 7496 is being released. 7497 7498 2. buffer/global/flat_atomic 7499 atomicrmw release - system - global 1. buffer_wbl2 7500 - generic 7501 - Must happen before 7502 following s_waitcnt. 7503 - Performs L2 writeback to 7504 ensure previous 7505 global/generic 7506 store/atomicrmw are 7507 visible at system scope. 7508 7509 2. s_waitcnt lgkmcnt(0) & 7510 vmcnt(0) 7511 7512 - If TgSplit execution mode, 7513 omit lgkmcnt(0). 7514 - If OpenCL, omit 7515 lgkmcnt(0). 7516 - Could be split into 7517 separate s_waitcnt 7518 vmcnt(0) and 7519 s_waitcnt 7520 lgkmcnt(0) to allow 7521 them to be 7522 independently moved 7523 according to the 7524 following rules. 7525 - s_waitcnt vmcnt(0) 7526 must happen after 7527 any preceding 7528 global/generic 7529 load/store/load 7530 atomic/store 7531 atomic/atomicrmw. 7532 - s_waitcnt lgkmcnt(0) 7533 must happen after 7534 any preceding 7535 local/generic 7536 load/store/load 7537 atomic/store 7538 atomic/atomicrmw. 7539 - Must happen before 7540 the following 7541 atomicrmw. 7542 - Ensures that all 7543 memory operations 7544 to memory and the L2 7545 writeback have 7546 completed before 7547 performing the 7548 store that is being 7549 released. 7550 7551 3. buffer/global/flat_atomic 7552 fence release - singlethread *none* *none* 7553 - wavefront 7554 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7555 7556 - Use lgkmcnt(0) if not 7557 TgSplit execution mode 7558 and vmcnt(0) if TgSplit 7559 execution mode. 7560 - If OpenCL and 7561 address space is 7562 not generic, omit 7563 lgkmcnt(0). 7564 - If OpenCL and 7565 address space is 7566 local, omit 7567 vmcnt(0). 7568 - However, since LLVM 7569 currently has no 7570 address space on 7571 the fence need to 7572 conservatively 7573 always generate. If 7574 fence had an 7575 address space then 7576 set to address 7577 space of OpenCL 7578 fence flag, or to 7579 generic if both 7580 local and global 7581 flags are 7582 specified. 7583 - s_waitcnt vmcnt(0) 7584 must happen after 7585 any preceding 7586 global/generic 7587 load/store/ 7588 load atomic/store atomic/ 7589 atomicrmw. 7590 - s_waitcnt lgkmcnt(0) 7591 must happen after 7592 any preceding 7593 local/generic 7594 load/load 7595 atomic/store/store 7596 atomic/atomicrmw. 7597 - Must happen before 7598 any following store 7599 atomic/atomicrmw 7600 with an equal or 7601 wider sync scope 7602 and memory ordering 7603 stronger than 7604 unordered (this is 7605 termed the 7606 fence-paired-atomic). 7607 - Ensures that all 7608 memory operations 7609 have 7610 completed before 7611 performing the 7612 following 7613 fence-paired-atomic. 7614 7615 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7616 vmcnt(0) 7617 7618 - If TgSplit execution mode, 7619 omit lgkmcnt(0). 7620 - If OpenCL and 7621 address space is 7622 not generic, omit 7623 lgkmcnt(0). 7624 - If OpenCL and 7625 address space is 7626 local, omit 7627 vmcnt(0). 7628 - However, since LLVM 7629 currently has no 7630 address space on 7631 the fence need to 7632 conservatively 7633 always generate. If 7634 fence had an 7635 address space then 7636 set to address 7637 space of OpenCL 7638 fence flag, or to 7639 generic if both 7640 local and global 7641 flags are 7642 specified. 7643 - Could be split into 7644 separate s_waitcnt 7645 vmcnt(0) and 7646 s_waitcnt 7647 lgkmcnt(0) to allow 7648 them to be 7649 independently moved 7650 according to the 7651 following rules. 7652 - s_waitcnt vmcnt(0) 7653 must happen after 7654 any preceding 7655 global/generic 7656 load/store/load 7657 atomic/store 7658 atomic/atomicrmw. 7659 - s_waitcnt lgkmcnt(0) 7660 must happen after 7661 any preceding 7662 local/generic 7663 load/store/load 7664 atomic/store 7665 atomic/atomicrmw. 7666 - Must happen before 7667 any following store 7668 atomic/atomicrmw 7669 with an equal or 7670 wider sync scope 7671 and memory ordering 7672 stronger than 7673 unordered (this is 7674 termed the 7675 fence-paired-atomic). 7676 - Ensures that all 7677 memory operations 7678 have 7679 completed before 7680 performing the 7681 following 7682 fence-paired-atomic. 7683 7684 fence release - system *none* 1. buffer_wbl2 7685 7686 - If OpenCL and 7687 address space is 7688 local, omit. 7689 - Must happen before 7690 following s_waitcnt. 7691 - Performs L2 writeback to 7692 ensure previous 7693 global/generic 7694 store/atomicrmw are 7695 visible at system scope. 7696 7697 2. s_waitcnt lgkmcnt(0) & 7698 vmcnt(0) 7699 7700 - If TgSplit execution mode, 7701 omit lgkmcnt(0). 7702 - If OpenCL and 7703 address space is 7704 not generic, omit 7705 lgkmcnt(0). 7706 - If OpenCL and 7707 address space is 7708 local, omit 7709 vmcnt(0). 7710 - However, since LLVM 7711 currently has no 7712 address space on 7713 the fence need to 7714 conservatively 7715 always generate. If 7716 fence had an 7717 address space then 7718 set to address 7719 space of OpenCL 7720 fence flag, or to 7721 generic if both 7722 local and global 7723 flags are 7724 specified. 7725 - Could be split into 7726 separate s_waitcnt 7727 vmcnt(0) and 7728 s_waitcnt 7729 lgkmcnt(0) to allow 7730 them to be 7731 independently moved 7732 according to the 7733 following rules. 7734 - s_waitcnt vmcnt(0) 7735 must happen after 7736 any preceding 7737 global/generic 7738 load/store/load 7739 atomic/store 7740 atomic/atomicrmw. 7741 - s_waitcnt lgkmcnt(0) 7742 must happen after 7743 any preceding 7744 local/generic 7745 load/store/load 7746 atomic/store 7747 atomic/atomicrmw. 7748 - Must happen before 7749 any following store 7750 atomic/atomicrmw 7751 with an equal or 7752 wider sync scope 7753 and memory ordering 7754 stronger than 7755 unordered (this is 7756 termed the 7757 fence-paired-atomic). 7758 - Ensures that all 7759 memory operations 7760 have 7761 completed before 7762 performing the 7763 following 7764 fence-paired-atomic. 7765 7766 **Acquire-Release Atomic** 7767 ------------------------------------------------------------------------------------ 7768 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7769 - wavefront - generic 7770 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7771 - wavefront local address space cannot 7772 be used.* 7773 7774 1. ds_atomic 7775 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7776 7777 - Use lgkmcnt(0) if not 7778 TgSplit execution mode 7779 and vmcnt(0) if TgSplit 7780 execution mode. 7781 - If OpenCL, omit 7782 lgkmcnt(0). 7783 - Must happen after 7784 any preceding 7785 local/generic 7786 load/store/load 7787 atomic/store 7788 atomic/atomicrmw. 7789 - s_waitcnt vmcnt(0) 7790 must happen after 7791 any preceding 7792 global/generic load/store/ 7793 load atomic/store atomic/ 7794 atomicrmw. 7795 - s_waitcnt lgkmcnt(0) 7796 must happen after 7797 any preceding 7798 local/generic 7799 load/store/load 7800 atomic/store 7801 atomic/atomicrmw. 7802 - Must happen before 7803 the following 7804 atomicrmw. 7805 - Ensures that all 7806 memory operations 7807 have 7808 completed before 7809 performing the 7810 atomicrmw that is 7811 being released. 7812 7813 2. buffer/global_atomic 7814 3. s_waitcnt vmcnt(0) 7815 7816 - If not TgSplit execution 7817 mode, omit. 7818 - Must happen before 7819 the following 7820 buffer_wbinvl1_vol. 7821 - Ensures any 7822 following global 7823 data read is no 7824 older than the 7825 atomicrmw value 7826 being acquired. 7827 7828 4. buffer_wbinvl1_vol 7829 7830 - If not TgSplit execution 7831 mode, omit. 7832 - Ensures that 7833 following 7834 loads will not see 7835 stale data. 7836 7837 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7838 local address space cannot 7839 be used.* 7840 7841 1. ds_atomic 7842 2. s_waitcnt lgkmcnt(0) 7843 7844 - If OpenCL, omit. 7845 - Must happen before 7846 any following 7847 global/generic 7848 load/load 7849 atomic/store/store 7850 atomic/atomicrmw. 7851 - Ensures any 7852 following global 7853 data read is no 7854 older than the local load 7855 atomic value being 7856 acquired. 7857 7858 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7859 7860 - Use lgkmcnt(0) if not 7861 TgSplit execution mode 7862 and vmcnt(0) if TgSplit 7863 execution mode. 7864 - If OpenCL, omit 7865 lgkmcnt(0). 7866 - s_waitcnt vmcnt(0) 7867 must happen after 7868 any preceding 7869 global/generic load/store/ 7870 load atomic/store atomic/ 7871 atomicrmw. 7872 - s_waitcnt lgkmcnt(0) 7873 must happen after 7874 any preceding 7875 local/generic 7876 load/store/load 7877 atomic/store 7878 atomic/atomicrmw. 7879 - Must happen before 7880 the following 7881 atomicrmw. 7882 - Ensures that all 7883 memory operations 7884 have 7885 completed before 7886 performing the 7887 atomicrmw that is 7888 being released. 7889 7890 2. flat_atomic 7891 3. s_waitcnt lgkmcnt(0) & 7892 vmcnt(0) 7893 7894 - If not TgSplit execution 7895 mode, omit vmcnt(0). 7896 - If OpenCL, omit 7897 lgkmcnt(0). 7898 - Must happen before 7899 the following 7900 buffer_wbinvl1_vol and 7901 any following 7902 global/generic 7903 load/load 7904 atomic/store/store 7905 atomic/atomicrmw. 7906 - Ensures any 7907 following global 7908 data read is no 7909 older than a local load 7910 atomic value being 7911 acquired. 7912 7913 3. buffer_wbinvl1_vol 7914 7915 - If not TgSplit execution 7916 mode, omit. 7917 - Ensures that 7918 following 7919 loads will not see 7920 stale data. 7921 7922 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7923 vmcnt(0) 7924 7925 - If TgSplit execution mode, 7926 omit lgkmcnt(0). 7927 - If OpenCL, omit 7928 lgkmcnt(0). 7929 - Could be split into 7930 separate s_waitcnt 7931 vmcnt(0) and 7932 s_waitcnt 7933 lgkmcnt(0) to allow 7934 them to be 7935 independently moved 7936 according to the 7937 following rules. 7938 - s_waitcnt vmcnt(0) 7939 must happen after 7940 any preceding 7941 global/generic 7942 load/store/load 7943 atomic/store 7944 atomic/atomicrmw. 7945 - s_waitcnt lgkmcnt(0) 7946 must happen after 7947 any preceding 7948 local/generic 7949 load/store/load 7950 atomic/store 7951 atomic/atomicrmw. 7952 - Must happen before 7953 the following 7954 atomicrmw. 7955 - Ensures that all 7956 memory operations 7957 to global have 7958 completed before 7959 performing the 7960 atomicrmw that is 7961 being released. 7962 7963 2. buffer/global_atomic 7964 3. s_waitcnt vmcnt(0) 7965 7966 - Must happen before 7967 following 7968 buffer_wbinvl1_vol. 7969 - Ensures the 7970 atomicrmw has 7971 completed before 7972 invalidating the 7973 cache. 7974 7975 4. buffer_wbinvl1_vol 7976 7977 - Must happen before 7978 any following 7979 global/generic 7980 load/load 7981 atomic/atomicrmw. 7982 - Ensures that 7983 following loads 7984 will not see stale 7985 global data. 7986 7987 atomicrmw acq_rel - system - global 1. buffer_wbl2 7988 7989 - Must happen before 7990 following s_waitcnt. 7991 - Performs L2 writeback to 7992 ensure previous 7993 global/generic 7994 store/atomicrmw are 7995 visible at system scope. 7996 7997 2. s_waitcnt lgkmcnt(0) & 7998 vmcnt(0) 7999 8000 - If TgSplit execution mode, 8001 omit lgkmcnt(0). 8002 - If OpenCL, omit 8003 lgkmcnt(0). 8004 - Could be split into 8005 separate s_waitcnt 8006 vmcnt(0) and 8007 s_waitcnt 8008 lgkmcnt(0) to allow 8009 them to be 8010 independently moved 8011 according to the 8012 following rules. 8013 - s_waitcnt vmcnt(0) 8014 must happen after 8015 any preceding 8016 global/generic 8017 load/store/load 8018 atomic/store 8019 atomic/atomicrmw. 8020 - s_waitcnt lgkmcnt(0) 8021 must happen after 8022 any preceding 8023 local/generic 8024 load/store/load 8025 atomic/store 8026 atomic/atomicrmw. 8027 - Must happen before 8028 the following 8029 atomicrmw. 8030 - Ensures that all 8031 memory operations 8032 to global and L2 writeback 8033 have completed before 8034 performing the 8035 atomicrmw that is 8036 being released. 8037 8038 3. buffer/global_atomic 8039 4. s_waitcnt vmcnt(0) 8040 8041 - Must happen before 8042 following buffer_invl2 and 8043 buffer_wbinvl1_vol. 8044 - Ensures the 8045 atomicrmw has 8046 completed before 8047 invalidating the 8048 caches. 8049 8050 5. buffer_invl2; 8051 buffer_wbinvl1_vol 8052 8053 - Must happen before 8054 any following 8055 global/generic 8056 load/load 8057 atomic/atomicrmw. 8058 - Ensures that 8059 following 8060 loads will not see 8061 stale L1 global data, 8062 nor see stale L2 MTYPE 8063 NC global data. 8064 MTYPE RW and CC memory will 8065 never be stale in L2 due to 8066 the memory probes. 8067 8068 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 8069 vmcnt(0) 8070 8071 - If TgSplit execution mode, 8072 omit lgkmcnt(0). 8073 - If OpenCL, omit 8074 lgkmcnt(0). 8075 - Could be split into 8076 separate s_waitcnt 8077 vmcnt(0) and 8078 s_waitcnt 8079 lgkmcnt(0) to allow 8080 them to be 8081 independently moved 8082 according to the 8083 following rules. 8084 - s_waitcnt vmcnt(0) 8085 must happen after 8086 any preceding 8087 global/generic 8088 load/store/load 8089 atomic/store 8090 atomic/atomicrmw. 8091 - s_waitcnt lgkmcnt(0) 8092 must happen after 8093 any preceding 8094 local/generic 8095 load/store/load 8096 atomic/store 8097 atomic/atomicrmw. 8098 - Must happen before 8099 the following 8100 atomicrmw. 8101 - Ensures that all 8102 memory operations 8103 to global have 8104 completed before 8105 performing the 8106 atomicrmw that is 8107 being released. 8108 8109 2. flat_atomic 8110 3. s_waitcnt vmcnt(0) & 8111 lgkmcnt(0) 8112 8113 - If TgSplit execution mode, 8114 omit lgkmcnt(0). 8115 - If OpenCL, omit 8116 lgkmcnt(0). 8117 - Must happen before 8118 following 8119 buffer_wbinvl1_vol. 8120 - Ensures the 8121 atomicrmw has 8122 completed before 8123 invalidating the 8124 cache. 8125 8126 4. buffer_wbinvl1_vol 8127 8128 - Must happen before 8129 any following 8130 global/generic 8131 load/load 8132 atomic/atomicrmw. 8133 - Ensures that 8134 following loads 8135 will not see stale 8136 global data. 8137 8138 atomicrmw acq_rel - system - generic 1. buffer_wbl2 8139 8140 - Must happen before 8141 following s_waitcnt. 8142 - Performs L2 writeback to 8143 ensure previous 8144 global/generic 8145 store/atomicrmw are 8146 visible at system scope. 8147 8148 2. s_waitcnt lgkmcnt(0) & 8149 vmcnt(0) 8150 8151 - If TgSplit execution mode, 8152 omit lgkmcnt(0). 8153 - If OpenCL, omit 8154 lgkmcnt(0). 8155 - Could be split into 8156 separate s_waitcnt 8157 vmcnt(0) and 8158 s_waitcnt 8159 lgkmcnt(0) to allow 8160 them to be 8161 independently moved 8162 according to the 8163 following rules. 8164 - s_waitcnt vmcnt(0) 8165 must happen after 8166 any preceding 8167 global/generic 8168 load/store/load 8169 atomic/store 8170 atomic/atomicrmw. 8171 - s_waitcnt lgkmcnt(0) 8172 must happen after 8173 any preceding 8174 local/generic 8175 load/store/load 8176 atomic/store 8177 atomic/atomicrmw. 8178 - Must happen before 8179 the following 8180 atomicrmw. 8181 - Ensures that all 8182 memory operations 8183 to global and L2 writeback 8184 have completed before 8185 performing the 8186 atomicrmw that is 8187 being released. 8188 8189 3. flat_atomic 8190 4. s_waitcnt vmcnt(0) & 8191 lgkmcnt(0) 8192 8193 - If TgSplit execution mode, 8194 omit lgkmcnt(0). 8195 - If OpenCL, omit 8196 lgkmcnt(0). 8197 - Must happen before 8198 following buffer_invl2 and 8199 buffer_wbinvl1_vol. 8200 - Ensures the 8201 atomicrmw has 8202 completed before 8203 invalidating the 8204 caches. 8205 8206 5. buffer_invl2; 8207 buffer_wbinvl1_vol 8208 8209 - Must happen before 8210 any following 8211 global/generic 8212 load/load 8213 atomic/atomicrmw. 8214 - Ensures that 8215 following 8216 loads will not see 8217 stale L1 global data, 8218 nor see stale L2 MTYPE 8219 NC global data. 8220 MTYPE RW and CC memory will 8221 never be stale in L2 due to 8222 the memory probes. 8223 8224 fence acq_rel - singlethread *none* *none* 8225 - wavefront 8226 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 8227 8228 - Use lgkmcnt(0) if not 8229 TgSplit execution mode 8230 and vmcnt(0) if TgSplit 8231 execution mode. 8232 - If OpenCL and 8233 address space is 8234 not generic, omit 8235 lgkmcnt(0). 8236 - If OpenCL and 8237 address space is 8238 local, omit 8239 vmcnt(0). 8240 - However, 8241 since LLVM 8242 currently has no 8243 address space on 8244 the fence need to 8245 conservatively 8246 always generate 8247 (see comment for 8248 previous fence). 8249 - s_waitcnt vmcnt(0) 8250 must happen after 8251 any preceding 8252 global/generic 8253 load/store/ 8254 load atomic/store atomic/ 8255 atomicrmw. 8256 - s_waitcnt lgkmcnt(0) 8257 must happen after 8258 any preceding 8259 local/generic 8260 load/load 8261 atomic/store/store 8262 atomic/atomicrmw. 8263 - Must happen before 8264 any following 8265 global/generic 8266 load/load 8267 atomic/store/store 8268 atomic/atomicrmw. 8269 - Ensures that all 8270 memory operations 8271 have 8272 completed before 8273 performing any 8274 following global 8275 memory operations. 8276 - Ensures that the 8277 preceding 8278 local/generic load 8279 atomic/atomicrmw 8280 with an equal or 8281 wider sync scope 8282 and memory ordering 8283 stronger than 8284 unordered (this is 8285 termed the 8286 acquire-fence-paired-atomic) 8287 has completed 8288 before following 8289 global memory 8290 operations. This 8291 satisfies the 8292 requirements of 8293 acquire. 8294 - Ensures that all 8295 previous memory 8296 operations have 8297 completed before a 8298 following 8299 local/generic store 8300 atomic/atomicrmw 8301 with an equal or 8302 wider sync scope 8303 and memory ordering 8304 stronger than 8305 unordered (this is 8306 termed the 8307 release-fence-paired-atomic). 8308 This satisfies the 8309 requirements of 8310 release. 8311 - Must happen before 8312 the following 8313 buffer_wbinvl1_vol. 8314 - Ensures that the 8315 acquire-fence-paired 8316 atomic has completed 8317 before invalidating 8318 the 8319 cache. Therefore 8320 any following 8321 locations read must 8322 be no older than 8323 the value read by 8324 the 8325 acquire-fence-paired-atomic. 8326 8327 2. buffer_wbinvl1_vol 8328 8329 - If not TgSplit execution 8330 mode, omit. 8331 - Ensures that 8332 following 8333 loads will not see 8334 stale data. 8335 8336 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 8337 vmcnt(0) 8338 8339 - If TgSplit execution mode, 8340 omit lgkmcnt(0). 8341 - If OpenCL and 8342 address space is 8343 not generic, omit 8344 lgkmcnt(0). 8345 - However, since LLVM 8346 currently has no 8347 address space on 8348 the fence need to 8349 conservatively 8350 always generate 8351 (see comment for 8352 previous fence). 8353 - Could be split into 8354 separate s_waitcnt 8355 vmcnt(0) and 8356 s_waitcnt 8357 lgkmcnt(0) to allow 8358 them to be 8359 independently moved 8360 according to the 8361 following rules. 8362 - s_waitcnt vmcnt(0) 8363 must happen after 8364 any preceding 8365 global/generic 8366 load/store/load 8367 atomic/store 8368 atomic/atomicrmw. 8369 - s_waitcnt lgkmcnt(0) 8370 must happen after 8371 any preceding 8372 local/generic 8373 load/store/load 8374 atomic/store 8375 atomic/atomicrmw. 8376 - Must happen before 8377 the following 8378 buffer_wbinvl1_vol. 8379 - Ensures that the 8380 preceding 8381 global/local/generic 8382 load 8383 atomic/atomicrmw 8384 with an equal or 8385 wider sync scope 8386 and memory ordering 8387 stronger than 8388 unordered (this is 8389 termed the 8390 acquire-fence-paired-atomic) 8391 has completed 8392 before invalidating 8393 the cache. This 8394 satisfies the 8395 requirements of 8396 acquire. 8397 - Ensures that all 8398 previous memory 8399 operations have 8400 completed before a 8401 following 8402 global/local/generic 8403 store 8404 atomic/atomicrmw 8405 with an equal or 8406 wider sync scope 8407 and memory ordering 8408 stronger than 8409 unordered (this is 8410 termed the 8411 release-fence-paired-atomic). 8412 This satisfies the 8413 requirements of 8414 release. 8415 8416 2. buffer_wbinvl1_vol 8417 8418 - Must happen before 8419 any following 8420 global/generic 8421 load/load 8422 atomic/store/store 8423 atomic/atomicrmw. 8424 - Ensures that 8425 following loads 8426 will not see stale 8427 global data. This 8428 satisfies the 8429 requirements of 8430 acquire. 8431 8432 fence acq_rel - system *none* 1. buffer_wbl2 8433 8434 - If OpenCL and 8435 address space is 8436 local, omit. 8437 - Must happen before 8438 following s_waitcnt. 8439 - Performs L2 writeback to 8440 ensure previous 8441 global/generic 8442 store/atomicrmw are 8443 visible at system scope. 8444 8445 2. s_waitcnt lgkmcnt(0) & 8446 vmcnt(0) 8447 8448 - If TgSplit execution mode, 8449 omit lgkmcnt(0). 8450 - If OpenCL and 8451 address space is 8452 not generic, omit 8453 lgkmcnt(0). 8454 - However, since LLVM 8455 currently has no 8456 address space on 8457 the fence need to 8458 conservatively 8459 always generate 8460 (see comment for 8461 previous fence). 8462 - Could be split into 8463 separate s_waitcnt 8464 vmcnt(0) and 8465 s_waitcnt 8466 lgkmcnt(0) to allow 8467 them to be 8468 independently moved 8469 according to the 8470 following rules. 8471 - s_waitcnt vmcnt(0) 8472 must happen after 8473 any preceding 8474 global/generic 8475 load/store/load 8476 atomic/store 8477 atomic/atomicrmw. 8478 - s_waitcnt lgkmcnt(0) 8479 must happen after 8480 any preceding 8481 local/generic 8482 load/store/load 8483 atomic/store 8484 atomic/atomicrmw. 8485 - Must happen before 8486 the following buffer_invl2 and 8487 buffer_wbinvl1_vol. 8488 - Ensures that the 8489 preceding 8490 global/local/generic 8491 load 8492 atomic/atomicrmw 8493 with an equal or 8494 wider sync scope 8495 and memory ordering 8496 stronger than 8497 unordered (this is 8498 termed the 8499 acquire-fence-paired-atomic) 8500 has completed 8501 before invalidating 8502 the cache. This 8503 satisfies the 8504 requirements of 8505 acquire. 8506 - Ensures that all 8507 previous memory 8508 operations have 8509 completed before a 8510 following 8511 global/local/generic 8512 store 8513 atomic/atomicrmw 8514 with an equal or 8515 wider sync scope 8516 and memory ordering 8517 stronger than 8518 unordered (this is 8519 termed the 8520 release-fence-paired-atomic). 8521 This satisfies the 8522 requirements of 8523 release. 8524 8525 3. buffer_invl2; 8526 buffer_wbinvl1_vol 8527 8528 - Must happen before 8529 any following 8530 global/generic 8531 load/load 8532 atomic/store/store 8533 atomic/atomicrmw. 8534 - Ensures that 8535 following 8536 loads will not see 8537 stale L1 global data, 8538 nor see stale L2 MTYPE 8539 NC global data. 8540 MTYPE RW and CC memory will 8541 never be stale in L2 due to 8542 the memory probes. 8543 8544 **Sequential Consistent Atomic** 8545 ------------------------------------------------------------------------------------ 8546 load atomic seq_cst - singlethread - global *Same as corresponding 8547 - wavefront - local load atomic acquire, 8548 - generic except must generate 8549 all instructions even 8550 for OpenCL.* 8551 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8552 - generic 8553 - Use lgkmcnt(0) if not 8554 TgSplit execution mode 8555 and vmcnt(0) if TgSplit 8556 execution mode. 8557 - s_waitcnt lgkmcnt(0) must 8558 happen after 8559 preceding 8560 local/generic load 8561 atomic/store 8562 atomic/atomicrmw 8563 with memory 8564 ordering of seq_cst 8565 and with equal or 8566 wider sync scope. 8567 (Note that seq_cst 8568 fences have their 8569 own s_waitcnt 8570 lgkmcnt(0) and so do 8571 not need to be 8572 considered.) 8573 - s_waitcnt vmcnt(0) 8574 must happen after 8575 preceding 8576 global/generic load 8577 atomic/store 8578 atomic/atomicrmw 8579 with memory 8580 ordering of seq_cst 8581 and with equal or 8582 wider sync scope. 8583 (Note that seq_cst 8584 fences have their 8585 own s_waitcnt 8586 vmcnt(0) and so do 8587 not need to be 8588 considered.) 8589 - Ensures any 8590 preceding 8591 sequential 8592 consistent global/local 8593 memory instructions 8594 have completed 8595 before executing 8596 this sequentially 8597 consistent 8598 instruction. This 8599 prevents reordering 8600 a seq_cst store 8601 followed by a 8602 seq_cst load. (Note 8603 that seq_cst is 8604 stronger than 8605 acquire/release as 8606 the reordering of 8607 load acquire 8608 followed by a store 8609 release is 8610 prevented by the 8611 s_waitcnt of 8612 the release, but 8613 there is nothing 8614 preventing a store 8615 release followed by 8616 load acquire from 8617 completing out of 8618 order. The s_waitcnt 8619 could be placed after 8620 seq_store or before 8621 the seq_load. We 8622 choose the load to 8623 make the s_waitcnt be 8624 as late as possible 8625 so that the store 8626 may have already 8627 completed.) 8628 8629 2. *Following 8630 instructions same as 8631 corresponding load 8632 atomic acquire, 8633 except must generate 8634 all instructions even 8635 for OpenCL.* 8636 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8637 local address space cannot 8638 be used.* 8639 8640 *Same as corresponding 8641 load atomic acquire, 8642 except must generate 8643 all instructions even 8644 for OpenCL.* 8645 8646 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8647 - system - generic vmcnt(0) 8648 8649 - If TgSplit execution mode, 8650 omit lgkmcnt(0). 8651 - Could be split into 8652 separate s_waitcnt 8653 vmcnt(0) 8654 and s_waitcnt 8655 lgkmcnt(0) to allow 8656 them to be 8657 independently moved 8658 according to the 8659 following rules. 8660 - s_waitcnt lgkmcnt(0) 8661 must happen after 8662 preceding 8663 global/generic load 8664 atomic/store 8665 atomic/atomicrmw 8666 with memory 8667 ordering of seq_cst 8668 and with equal or 8669 wider sync scope. 8670 (Note that seq_cst 8671 fences have their 8672 own s_waitcnt 8673 lgkmcnt(0) and so do 8674 not need to be 8675 considered.) 8676 - s_waitcnt vmcnt(0) 8677 must happen after 8678 preceding 8679 global/generic load 8680 atomic/store 8681 atomic/atomicrmw 8682 with memory 8683 ordering of seq_cst 8684 and with equal or 8685 wider sync scope. 8686 (Note that seq_cst 8687 fences have their 8688 own s_waitcnt 8689 vmcnt(0) and so do 8690 not need to be 8691 considered.) 8692 - Ensures any 8693 preceding 8694 sequential 8695 consistent global 8696 memory instructions 8697 have completed 8698 before executing 8699 this sequentially 8700 consistent 8701 instruction. This 8702 prevents reordering 8703 a seq_cst store 8704 followed by a 8705 seq_cst load. (Note 8706 that seq_cst is 8707 stronger than 8708 acquire/release as 8709 the reordering of 8710 load acquire 8711 followed by a store 8712 release is 8713 prevented by the 8714 s_waitcnt of 8715 the release, but 8716 there is nothing 8717 preventing a store 8718 release followed by 8719 load acquire from 8720 completing out of 8721 order. The s_waitcnt 8722 could be placed after 8723 seq_store or before 8724 the seq_load. We 8725 choose the load to 8726 make the s_waitcnt be 8727 as late as possible 8728 so that the store 8729 may have already 8730 completed.) 8731 8732 2. *Following 8733 instructions same as 8734 corresponding load 8735 atomic acquire, 8736 except must generate 8737 all instructions even 8738 for OpenCL.* 8739 store atomic seq_cst - singlethread - global *Same as corresponding 8740 - wavefront - local store atomic release, 8741 - workgroup - generic except must generate 8742 - agent all instructions even 8743 - system for OpenCL.* 8744 atomicrmw seq_cst - singlethread - global *Same as corresponding 8745 - wavefront - local atomicrmw acq_rel, 8746 - workgroup - generic except must generate 8747 - agent all instructions even 8748 - system for OpenCL.* 8749 fence seq_cst - singlethread *none* *Same as corresponding 8750 - wavefront fence acq_rel, 8751 - workgroup except must generate 8752 - agent all instructions even 8753 - system for OpenCL.* 8754 ============ ============ ============== ========== ================================ 8755 8756.. _amdgpu-amdhsa-memory-model-gfx940: 8757 8758Memory Model GFX940 8759+++++++++++++++++++ 8760 8761For GFX940: 8762 8763* Each agent has multiple shader arrays (SA). 8764* Each SA has multiple compute units (CU). 8765* Each CU has multiple SIMDs that execute wavefronts. 8766* The wavefronts for a single work-group are executed in the same CU but may be 8767 executed by different SIMDs. The exception is when in tgsplit execution mode 8768 when the wavefronts may be executed by different SIMDs in different CUs. 8769* Each CU has a single LDS memory shared by the wavefronts of the work-groups 8770 executing on it. The exception is when in tgsplit execution mode when no LDS 8771 is allocated as wavefronts of the same work-group can be in different CUs. 8772* All LDS operations of a CU are performed as wavefront wide operations in a 8773 global order and involve no caching. Completion is reported to a wavefront in 8774 execution order. 8775* The LDS memory has multiple request queues shared by the SIMDs of a 8776 CU. Therefore, the LDS operations performed by different wavefronts of a 8777 work-group can be reordered relative to each other, which can result in 8778 reordering the visibility of vector memory operations with respect to LDS 8779 operations of other wavefronts in the same work-group. A ``s_waitcnt 8780 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8781 vector memory operations between wavefronts of a work-group, but not between 8782 operations performed by the same wavefront. 8783* The vector memory operations are performed as wavefront wide operations and 8784 completion is reported to a wavefront in execution order. The exception is 8785 that ``flat_load/store/atomic`` instructions can report out of vector memory 8786 order if they access LDS memory, and out of LDS operation order if they access 8787 global memory. 8788* The vector memory operations access a single vector L1 cache shared by all 8789 SIMDs a CU. Therefore: 8790 8791 * No special action is required for coherence between the lanes of a single 8792 wavefront. 8793 8794 * No special action is required for coherence between wavefronts in the same 8795 work-group since they execute on the same CU. The exception is when in 8796 tgsplit execution mode as wavefronts of the same work-group can be in 8797 different CUs and so a ``buffer_inv sc0`` is required which will invalidate 8798 the L1 cache. 8799 8800 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence 8801 between wavefronts executing in different work-groups as they may be 8802 executing on different CUs. 8803 8804 * Atomic read-modify-write instructions implicitly bypass the L1 cache. 8805 Therefore, they do not use the sc0 bit for coherence and instead use it to 8806 indicate if the instruction returns the original value being updated. They 8807 do use sc1 to indicate system or agent scope coherence. 8808 8809* The scalar memory operations access a scalar L1 cache shared by all wavefronts 8810 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 8811 scalar operations are used in a restricted way so do not impact the memory 8812 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 8813* The vector and scalar memory operations use an L2 cache. 8814 8815 * The gfx940 can be configured as a number of smaller agents with each having 8816 a single L2 shared by all CUs on the same agent, or as fewer (possibly one) 8817 larger agents with groups of CUs on each agent each sharing separate L2 8818 caches. 8819 * The L2 cache has independent channels to service disjoint ranges of virtual 8820 addresses. 8821 * Each CU has a separate request queue per channel for its associated L2. 8822 Therefore, the vector and scalar memory operations performed by wavefronts 8823 executing with different L1 caches and the same L2 cache can be reordered 8824 relative to each other. 8825 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between 8826 vector memory operations of different CUs. It ensures a previous vector 8827 memory operation has completed before executing a subsequent vector memory 8828 or LDS operation and so can be used to meet the requirements of acquire and 8829 release. 8830 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW 8831 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with 8832 the PTE C-bit set for memory not local to the L2. 8833 8834 * Any local memory cache lines will be automatically invalidated by writes 8835 from CUs associated with other L2 caches, or writes from the CPU, due to 8836 the cache probe caused by the PTE C-bit. 8837 * XGMI accesses from the CPU to local memory may be cached on the CPU. 8838 Subsequent access from the GPU will automatically invalidate or writeback 8839 the CPU cache due to the L2 probe filter. 8840 * To ensure coherence of local memory writes of CUs with different L1 caches 8841 in the same agent a ``buffer_wbl2`` is required. It does nothing if the 8842 agent is configured to have a single L2, or will writeback dirty L2 cache 8843 lines if configured to have multiple L2 caches. 8844 * To ensure coherence of local memory writes of CUs in different agents a 8845 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines. 8846 * To ensure coherence of local memory reads of CUs with different L1 caches 8847 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the 8848 agent is configured to have a single L2, or will invalidate non-local L2 8849 cache lines if configured to have multiple L2 caches. 8850 * To ensure coherence of local memory reads of CUs in different agents a 8851 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache 8852 lines if configured to have multiple L2 caches. 8853 8854 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE 8855 UC (uncached) which bypasses the L2. 8856 8857Scalar memory operations are only used to access memory that is proven to not 8858change during the execution of the kernel dispatch. This includes constant 8859address space and global address space for program scope ``const`` variables. 8860Therefore, the kernel machine code does not have to maintain the scalar cache to 8861ensure it is coherent with the vector caches. The scalar and vector caches are 8862invalidated between kernel dispatches by CP since constant address space data 8863may change between kernel dispatch executions. See 8864:ref:`amdgpu-amdhsa-memory-spaces`. 8865 8866The one exception is if scalar writes are used to spill SGPR registers. In this 8867case the AMDGPU backend ensures the memory location used to spill is never 8868accessed by vector memory operations at the same time. If scalar writes are used 8869then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8870return since the locations may be used for vector memory instructions by a 8871future wavefront that uses the same scratch area, or a function call that 8872creates a frame at the same address, respectively. There is no need for a 8873``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8874 8875For kernarg backing memory: 8876 8877* CP invalidates the L1 cache at the start of each kernel dispatch. 8878* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 8879 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 8880 cache. This also causes it to be treated as non-volatile and so is not 8881 invalidated by ``*_vol``. 8882* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8883 so the L2 cache will be coherent with the CPU and other agents. 8884 8885Scratch backing memory (which is used for the private address space) is accessed 8886with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 8887only accessed by a single thread, and is always write-before-read, there is 8888never a need to invalidate these entries from the L1 cache. Hence all cache 8889invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 8890 8891The code sequences used to implement the memory model for GFX940 are defined 8892in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`. 8893 8894 .. table:: AMDHSA Memory Model Code Sequences GFX940 8895 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table 8896 8897 ============ ============ ============== ========== ================================ 8898 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8899 Ordering Sync Scope Address GFX940 8900 Space 8901 ============ ============ ============== ========== ================================ 8902 **Non-Atomic** 8903 ------------------------------------------------------------------------------------ 8904 load *none* *none* - global - !volatile & !nontemporal 8905 - generic 8906 - private 1. buffer/global/flat_load 8907 - constant 8908 - !volatile & nontemporal 8909 8910 1. buffer/global/flat_load 8911 nt=1 8912 8913 - volatile 8914 8915 1. buffer/global/flat_load 8916 sc0=1 sc1=1 8917 2. s_waitcnt vmcnt(0) 8918 8919 - Must happen before 8920 any following volatile 8921 global/generic 8922 load/store. 8923 - Ensures that 8924 volatile 8925 operations to 8926 different 8927 addresses will not 8928 be reordered by 8929 hardware. 8930 8931 load *none* *none* - local 1. ds_load 8932 store *none* *none* - global - !volatile & !nontemporal 8933 - generic 8934 - private 1. buffer/global/flat_store 8935 - constant 8936 - !volatile & nontemporal 8937 8938 1. buffer/global/flat_store 8939 nt=1 8940 8941 - volatile 8942 8943 1. buffer/global/flat_store 8944 sc0=1 sc1=1 8945 2. s_waitcnt vmcnt(0) 8946 8947 - Must happen before 8948 any following volatile 8949 global/generic 8950 load/store. 8951 - Ensures that 8952 volatile 8953 operations to 8954 different 8955 addresses will not 8956 be reordered by 8957 hardware. 8958 8959 store *none* *none* - local 1. ds_store 8960 **Unordered Atomic** 8961 ------------------------------------------------------------------------------------ 8962 load atomic unordered *any* *any* *Same as non-atomic*. 8963 store atomic unordered *any* *any* *Same as non-atomic*. 8964 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8965 **Monotonic Atomic** 8966 ------------------------------------------------------------------------------------ 8967 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8968 - wavefront - generic 8969 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8970 - generic sc0=1 8971 load atomic monotonic - singlethread - local *If TgSplit execution mode, 8972 - wavefront local address space cannot 8973 - workgroup be used.* 8974 8975 1. ds_load 8976 load atomic monotonic - agent - global 1. buffer/global/flat_load 8977 - generic sc1=1 8978 load atomic monotonic - system - global 1. buffer/global/flat_load 8979 - generic sc0=1 sc1=1 8980 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8981 - wavefront - generic 8982 store atomic monotonic - workgroup - global 1. buffer/global/flat_store 8983 - generic sc0=1 8984 store atomic monotonic - agent - global 1. buffer/global/flat_store 8985 - generic sc1=1 8986 store atomic monotonic - system - global 1. buffer/global/flat_store 8987 - generic sc0=1 sc1=1 8988 store atomic monotonic - singlethread - local *If TgSplit execution mode, 8989 - wavefront local address space cannot 8990 - workgroup be used.* 8991 8992 1. ds_store 8993 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8994 - wavefront - generic 8995 - workgroup 8996 - agent 8997 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 8998 - generic sc1=1 8999 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 9000 - wavefront local address space cannot 9001 - workgroup be used.* 9002 9003 1. ds_atomic 9004 **Acquire Atomic** 9005 ------------------------------------------------------------------------------------ 9006 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 9007 - wavefront - local 9008 - generic 9009 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1 9010 2. s_waitcnt vmcnt(0) 9011 9012 - If not TgSplit execution 9013 mode, omit. 9014 - Must happen before the 9015 following buffer_inv. 9016 9017 3. buffer_inv sc0=1 9018 9019 - If not TgSplit execution 9020 mode, omit. 9021 - Must happen before 9022 any following 9023 global/generic 9024 load/load 9025 atomic/store/store 9026 atomic/atomicrmw. 9027 - Ensures that 9028 following 9029 loads will not see 9030 stale data. 9031 9032 load atomic acquire - workgroup - local *If TgSplit execution mode, 9033 local address space cannot 9034 be used.* 9035 9036 1. ds_load 9037 2. s_waitcnt lgkmcnt(0) 9038 9039 - If OpenCL, omit. 9040 - Must happen before 9041 any following 9042 global/generic 9043 load/load 9044 atomic/store/store 9045 atomic/atomicrmw. 9046 - Ensures any 9047 following global 9048 data read is no 9049 older than the local load 9050 atomic value being 9051 acquired. 9052 9053 load atomic acquire - workgroup - generic 1. flat_load sc0=1 9054 2. s_waitcnt lgkm/vmcnt(0) 9055 9056 - Use lgkmcnt(0) if not 9057 TgSplit execution mode 9058 and vmcnt(0) if TgSplit 9059 execution mode. 9060 - If OpenCL, omit lgkmcnt(0). 9061 - Must happen before 9062 the following 9063 buffer_inv and any 9064 following global/generic 9065 load/load 9066 atomic/store/store 9067 atomic/atomicrmw. 9068 - Ensures any 9069 following global 9070 data read is no 9071 older than a local load 9072 atomic value being 9073 acquired. 9074 9075 3. buffer_inv sc0=1 9076 9077 - If not TgSplit execution 9078 mode, omit. 9079 - Ensures that 9080 following 9081 loads will not see 9082 stale data. 9083 9084 load atomic acquire - agent - global 1. buffer/global_load 9085 sc1=1 9086 2. s_waitcnt vmcnt(0) 9087 9088 - Must happen before 9089 following 9090 buffer_inv. 9091 - Ensures the load 9092 has completed 9093 before invalidating 9094 the cache. 9095 9096 3. buffer_inv sc1=1 9097 9098 - Must happen before 9099 any following 9100 global/generic 9101 load/load 9102 atomic/atomicrmw. 9103 - Ensures that 9104 following 9105 loads will not see 9106 stale global data. 9107 9108 load atomic acquire - system - global 1. buffer/global/flat_load 9109 sc0=1 sc1=1 9110 2. s_waitcnt vmcnt(0) 9111 9112 - Must happen before 9113 following 9114 buffer_inv. 9115 - Ensures the load 9116 has completed 9117 before invalidating 9118 the cache. 9119 9120 3. buffer_inv sc0=1 sc1=1 9121 9122 - Must happen before 9123 any following 9124 global/generic 9125 load/load 9126 atomic/atomicrmw. 9127 - Ensures that 9128 following 9129 loads will not see 9130 stale MTYPE NC global data. 9131 MTYPE RW and CC memory will 9132 never be stale due to the 9133 memory probes. 9134 9135 load atomic acquire - agent - generic 1. flat_load sc1=1 9136 2. s_waitcnt vmcnt(0) & 9137 lgkmcnt(0) 9138 9139 - If TgSplit execution mode, 9140 omit lgkmcnt(0). 9141 - If OpenCL omit 9142 lgkmcnt(0). 9143 - Must happen before 9144 following 9145 buffer_inv. 9146 - Ensures the flat_load 9147 has completed 9148 before invalidating 9149 the cache. 9150 9151 3. buffer_inv sc1=1 9152 9153 - Must happen before 9154 any following 9155 global/generic 9156 load/load 9157 atomic/atomicrmw. 9158 - Ensures that 9159 following loads 9160 will not see stale 9161 global data. 9162 9163 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1 9164 2. s_waitcnt vmcnt(0) & 9165 lgkmcnt(0) 9166 9167 - If TgSplit execution mode, 9168 omit lgkmcnt(0). 9169 - If OpenCL omit 9170 lgkmcnt(0). 9171 - Must happen before 9172 the following 9173 buffer_inv. 9174 - Ensures the flat_load 9175 has completed 9176 before invalidating 9177 the caches. 9178 9179 3. buffer_inv sc0=1 sc1=1 9180 9181 - Must happen before 9182 any following 9183 global/generic 9184 load/load 9185 atomic/atomicrmw. 9186 - Ensures that 9187 following 9188 loads will not see 9189 stale MTYPE NC global data. 9190 MTYPE RW and CC memory will 9191 never be stale due to the 9192 memory probes. 9193 9194 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 9195 - wavefront - generic 9196 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 9197 - wavefront local address space cannot 9198 be used.* 9199 9200 1. ds_atomic 9201 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 9202 2. s_waitcnt vmcnt(0) 9203 9204 - If not TgSplit execution 9205 mode, omit. 9206 - Must happen before the 9207 following buffer_inv. 9208 - Ensures the atomicrmw 9209 has completed 9210 before invalidating 9211 the cache. 9212 9213 3. buffer_inv sc0=1 9214 9215 - If not TgSplit execution 9216 mode, omit. 9217 - Must happen before 9218 any following 9219 global/generic 9220 load/load 9221 atomic/atomicrmw. 9222 - Ensures that 9223 following loads 9224 will not see stale 9225 global data. 9226 9227 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 9228 local address space cannot 9229 be used.* 9230 9231 1. ds_atomic 9232 2. s_waitcnt lgkmcnt(0) 9233 9234 - If OpenCL, omit. 9235 - Must happen before 9236 any following 9237 global/generic 9238 load/load 9239 atomic/store/store 9240 atomic/atomicrmw. 9241 - Ensures any 9242 following global 9243 data read is no 9244 older than the local 9245 atomicrmw value 9246 being acquired. 9247 9248 atomicrmw acquire - workgroup - generic 1. flat_atomic 9249 2. s_waitcnt lgkm/vmcnt(0) 9250 9251 - Use lgkmcnt(0) if not 9252 TgSplit execution mode 9253 and vmcnt(0) if TgSplit 9254 execution mode. 9255 - If OpenCL, omit lgkmcnt(0). 9256 - Must happen before 9257 the following 9258 buffer_inv and 9259 any following 9260 global/generic 9261 load/load 9262 atomic/store/store 9263 atomic/atomicrmw. 9264 - Ensures any 9265 following global 9266 data read is no 9267 older than a local 9268 atomicrmw value 9269 being acquired. 9270 9271 3. buffer_inv sc0=1 9272 9273 - If not TgSplit execution 9274 mode, omit. 9275 - Ensures that 9276 following 9277 loads will not see 9278 stale data. 9279 9280 atomicrmw acquire - agent - global 1. buffer/global_atomic 9281 2. s_waitcnt vmcnt(0) 9282 9283 - Must happen before 9284 following 9285 buffer_inv. 9286 - Ensures the 9287 atomicrmw has 9288 completed before 9289 invalidating the 9290 cache. 9291 9292 3. buffer_inv sc1=1 9293 9294 - Must happen before 9295 any following 9296 global/generic 9297 load/load 9298 atomic/atomicrmw. 9299 - Ensures that 9300 following loads 9301 will not see stale 9302 global data. 9303 9304 atomicrmw acquire - system - global 1. buffer/global_atomic 9305 sc1=1 9306 2. s_waitcnt vmcnt(0) 9307 9308 - Must happen before 9309 following 9310 buffer_inv. 9311 - Ensures the 9312 atomicrmw has 9313 completed before 9314 invalidating the 9315 caches. 9316 9317 3. buffer_inv sc0=1 sc1=1 9318 9319 - Must happen before 9320 any following 9321 global/generic 9322 load/load 9323 atomic/atomicrmw. 9324 - Ensures that 9325 following 9326 loads will not see 9327 stale MTYPE NC global data. 9328 MTYPE RW and CC memory will 9329 never be stale due to the 9330 memory probes. 9331 9332 atomicrmw acquire - agent - generic 1. flat_atomic 9333 2. s_waitcnt vmcnt(0) & 9334 lgkmcnt(0) 9335 9336 - If TgSplit execution mode, 9337 omit lgkmcnt(0). 9338 - If OpenCL, omit 9339 lgkmcnt(0). 9340 - Must happen before 9341 following 9342 buffer_inv. 9343 - Ensures the 9344 atomicrmw has 9345 completed before 9346 invalidating the 9347 cache. 9348 9349 3. buffer_inv sc1=1 9350 9351 - Must happen before 9352 any following 9353 global/generic 9354 load/load 9355 atomic/atomicrmw. 9356 - Ensures that 9357 following loads 9358 will not see stale 9359 global data. 9360 9361 atomicrmw acquire - system - generic 1. flat_atomic sc1=1 9362 2. s_waitcnt vmcnt(0) & 9363 lgkmcnt(0) 9364 9365 - If TgSplit execution mode, 9366 omit lgkmcnt(0). 9367 - If OpenCL, omit 9368 lgkmcnt(0). 9369 - Must happen before 9370 following 9371 buffer_inv. 9372 - Ensures the 9373 atomicrmw has 9374 completed before 9375 invalidating the 9376 caches. 9377 9378 3. buffer_inv sc0=1 sc1=1 9379 9380 - Must happen before 9381 any following 9382 global/generic 9383 load/load 9384 atomic/atomicrmw. 9385 - Ensures that 9386 following 9387 loads will not see 9388 stale MTYPE NC global data. 9389 MTYPE RW and CC memory will 9390 never be stale due to the 9391 memory probes. 9392 9393 fence acquire - singlethread *none* *none* 9394 - wavefront 9395 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9396 9397 - Use lgkmcnt(0) if not 9398 TgSplit execution mode 9399 and vmcnt(0) if TgSplit 9400 execution mode. 9401 - If OpenCL and 9402 address space is 9403 not generic, omit 9404 lgkmcnt(0). 9405 - If OpenCL and 9406 address space is 9407 local, omit 9408 vmcnt(0). 9409 - However, since LLVM 9410 currently has no 9411 address space on 9412 the fence need to 9413 conservatively 9414 always generate. If 9415 fence had an 9416 address space then 9417 set to address 9418 space of OpenCL 9419 fence flag, or to 9420 generic if both 9421 local and global 9422 flags are 9423 specified. 9424 - s_waitcnt vmcnt(0) 9425 must happen after 9426 any preceding 9427 global/generic load 9428 atomic/ 9429 atomicrmw 9430 with an equal or 9431 wider sync scope 9432 and memory ordering 9433 stronger than 9434 unordered (this is 9435 termed the 9436 fence-paired-atomic). 9437 - s_waitcnt lgkmcnt(0) 9438 must happen after 9439 any preceding 9440 local/generic load 9441 atomic/atomicrmw 9442 with an equal or 9443 wider sync scope 9444 and memory ordering 9445 stronger than 9446 unordered (this is 9447 termed the 9448 fence-paired-atomic). 9449 - Must happen before 9450 the following 9451 buffer_inv and 9452 any following 9453 global/generic 9454 load/load 9455 atomic/store/store 9456 atomic/atomicrmw. 9457 - Ensures any 9458 following global 9459 data read is no 9460 older than the 9461 value read by the 9462 fence-paired-atomic. 9463 9464 3. buffer_inv sc0=1 9465 9466 - If not TgSplit execution 9467 mode, omit. 9468 - Ensures that 9469 following 9470 loads will not see 9471 stale data. 9472 9473 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 9474 vmcnt(0) 9475 9476 - If TgSplit execution mode, 9477 omit lgkmcnt(0). 9478 - If OpenCL and 9479 address space is 9480 not generic, omit 9481 lgkmcnt(0). 9482 - However, since LLVM 9483 currently has no 9484 address space on 9485 the fence need to 9486 conservatively 9487 always generate 9488 (see comment for 9489 previous fence). 9490 - Could be split into 9491 separate s_waitcnt 9492 vmcnt(0) and 9493 s_waitcnt 9494 lgkmcnt(0) to allow 9495 them to be 9496 independently moved 9497 according to the 9498 following rules. 9499 - s_waitcnt vmcnt(0) 9500 must happen after 9501 any preceding 9502 global/generic load 9503 atomic/atomicrmw 9504 with an equal or 9505 wider sync scope 9506 and memory ordering 9507 stronger than 9508 unordered (this is 9509 termed the 9510 fence-paired-atomic). 9511 - s_waitcnt lgkmcnt(0) 9512 must happen after 9513 any preceding 9514 local/generic load 9515 atomic/atomicrmw 9516 with an equal or 9517 wider sync scope 9518 and memory ordering 9519 stronger than 9520 unordered (this is 9521 termed the 9522 fence-paired-atomic). 9523 - Must happen before 9524 the following 9525 buffer_inv. 9526 - Ensures that the 9527 fence-paired atomic 9528 has completed 9529 before invalidating 9530 the 9531 cache. Therefore 9532 any following 9533 locations read must 9534 be no older than 9535 the value read by 9536 the 9537 fence-paired-atomic. 9538 9539 2. buffer_inv sc1=1 9540 9541 - Must happen before any 9542 following global/generic 9543 load/load 9544 atomic/store/store 9545 atomic/atomicrmw. 9546 - Ensures that 9547 following loads 9548 will not see stale 9549 global data. 9550 9551 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 9552 vmcnt(0) 9553 9554 - If TgSplit execution mode, 9555 omit lgkmcnt(0). 9556 - If OpenCL and 9557 address space is 9558 not generic, omit 9559 lgkmcnt(0). 9560 - However, since LLVM 9561 currently has no 9562 address space on 9563 the fence need to 9564 conservatively 9565 always generate 9566 (see comment for 9567 previous fence). 9568 - Could be split into 9569 separate s_waitcnt 9570 vmcnt(0) and 9571 s_waitcnt 9572 lgkmcnt(0) to allow 9573 them to be 9574 independently moved 9575 according to the 9576 following rules. 9577 - s_waitcnt vmcnt(0) 9578 must happen after 9579 any preceding 9580 global/generic load 9581 atomic/atomicrmw 9582 with an equal or 9583 wider sync scope 9584 and memory ordering 9585 stronger than 9586 unordered (this is 9587 termed the 9588 fence-paired-atomic). 9589 - s_waitcnt lgkmcnt(0) 9590 must happen after 9591 any preceding 9592 local/generic load 9593 atomic/atomicrmw 9594 with an equal or 9595 wider sync scope 9596 and memory ordering 9597 stronger than 9598 unordered (this is 9599 termed the 9600 fence-paired-atomic). 9601 - Must happen before 9602 the following 9603 buffer_inv. 9604 - Ensures that the 9605 fence-paired atomic 9606 has completed 9607 before invalidating 9608 the 9609 cache. Therefore 9610 any following 9611 locations read must 9612 be no older than 9613 the value read by 9614 the 9615 fence-paired-atomic. 9616 9617 2. buffer_inv sc0=1 sc1=1 9618 9619 - Must happen before any 9620 following global/generic 9621 load/load 9622 atomic/store/store 9623 atomic/atomicrmw. 9624 - Ensures that 9625 following loads 9626 will not see stale 9627 global data. 9628 9629 **Release Atomic** 9630 ------------------------------------------------------------------------------------ 9631 store atomic release - singlethread - global 1. buffer/global/flat_store 9632 - wavefront - generic 9633 store atomic release - singlethread - local *If TgSplit execution mode, 9634 - wavefront local address space cannot 9635 be used.* 9636 9637 1. ds_store 9638 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9639 - generic 9640 - Use lgkmcnt(0) if not 9641 TgSplit execution mode 9642 and vmcnt(0) if TgSplit 9643 execution mode. 9644 - If OpenCL, omit lgkmcnt(0). 9645 - s_waitcnt vmcnt(0) 9646 must happen after 9647 any preceding 9648 global/generic load/store/ 9649 load atomic/store atomic/ 9650 atomicrmw. 9651 - s_waitcnt lgkmcnt(0) 9652 must happen after 9653 any preceding 9654 local/generic 9655 load/store/load 9656 atomic/store 9657 atomic/atomicrmw. 9658 - Must happen before 9659 the following 9660 store. 9661 - Ensures that all 9662 memory operations 9663 have 9664 completed before 9665 performing the 9666 store that is being 9667 released. 9668 9669 2. buffer/global/flat_store sc0=1 9670 store atomic release - workgroup - local *If TgSplit execution mode, 9671 local address space cannot 9672 be used.* 9673 9674 1. ds_store 9675 store atomic release - agent - global 1. buffer_wbl2 sc1=1 9676 - generic 9677 - Must happen before 9678 following s_waitcnt. 9679 - Performs L2 writeback to 9680 ensure previous 9681 global/generic 9682 store/atomicrmw are 9683 visible at agent scope. 9684 9685 2. s_waitcnt lgkmcnt(0) & 9686 vmcnt(0) 9687 9688 - If TgSplit execution mode, 9689 omit lgkmcnt(0). 9690 - If OpenCL and 9691 address space is 9692 not generic, omit 9693 lgkmcnt(0). 9694 - Could be split into 9695 separate s_waitcnt 9696 vmcnt(0) and 9697 s_waitcnt 9698 lgkmcnt(0) to allow 9699 them to be 9700 independently moved 9701 according to the 9702 following rules. 9703 - s_waitcnt vmcnt(0) 9704 must happen after 9705 any preceding 9706 global/generic 9707 load/store/load 9708 atomic/store 9709 atomic/atomicrmw. 9710 - s_waitcnt lgkmcnt(0) 9711 must happen after 9712 any preceding 9713 local/generic 9714 load/store/load 9715 atomic/store 9716 atomic/atomicrmw. 9717 - Must happen before 9718 the following 9719 store. 9720 - Ensures that all 9721 memory operations 9722 to memory have 9723 completed before 9724 performing the 9725 store that is being 9726 released. 9727 9728 3. buffer/global/flat_store sc1=1 9729 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9730 - generic 9731 - Must happen before 9732 following s_waitcnt. 9733 - Performs L2 writeback to 9734 ensure previous 9735 global/generic 9736 store/atomicrmw are 9737 visible at system scope. 9738 9739 2. s_waitcnt lgkmcnt(0) & 9740 vmcnt(0) 9741 9742 - If TgSplit execution mode, 9743 omit lgkmcnt(0). 9744 - If OpenCL and 9745 address space is 9746 not generic, omit 9747 lgkmcnt(0). 9748 - Could be split into 9749 separate s_waitcnt 9750 vmcnt(0) and 9751 s_waitcnt 9752 lgkmcnt(0) to allow 9753 them to be 9754 independently moved 9755 according to the 9756 following rules. 9757 - s_waitcnt vmcnt(0) 9758 must happen after any 9759 preceding 9760 global/generic 9761 load/store/load 9762 atomic/store 9763 atomic/atomicrmw. 9764 - s_waitcnt lgkmcnt(0) 9765 must happen after any 9766 preceding 9767 local/generic 9768 load/store/load 9769 atomic/store 9770 atomic/atomicrmw. 9771 - Must happen before 9772 the following 9773 store. 9774 - Ensures that all 9775 memory operations 9776 to memory and the L2 9777 writeback have 9778 completed before 9779 performing the 9780 store that is being 9781 released. 9782 9783 3. buffer/global/flat_store 9784 sc0=1 sc1=1 9785 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 9786 - wavefront - generic 9787 atomicrmw release - singlethread - local *If TgSplit execution mode, 9788 - wavefront local address space cannot 9789 be used.* 9790 9791 1. ds_atomic 9792 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9793 - generic 9794 - Use lgkmcnt(0) if not 9795 TgSplit execution mode 9796 and vmcnt(0) if TgSplit 9797 execution mode. 9798 - If OpenCL, omit 9799 lgkmcnt(0). 9800 - s_waitcnt vmcnt(0) 9801 must happen after 9802 any preceding 9803 global/generic load/store/ 9804 load atomic/store atomic/ 9805 atomicrmw. 9806 - s_waitcnt lgkmcnt(0) 9807 must happen after 9808 any preceding 9809 local/generic 9810 load/store/load 9811 atomic/store 9812 atomic/atomicrmw. 9813 - Must happen before 9814 the following 9815 atomicrmw. 9816 - Ensures that all 9817 memory operations 9818 have 9819 completed before 9820 performing the 9821 atomicrmw that is 9822 being released. 9823 9824 2. buffer/global/flat_atomic sc0=1 9825 atomicrmw release - workgroup - local *If TgSplit execution mode, 9826 local address space cannot 9827 be used.* 9828 9829 1. ds_atomic 9830 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1 9831 - generic 9832 - Must happen before 9833 following s_waitcnt. 9834 - Performs L2 writeback to 9835 ensure previous 9836 global/generic 9837 store/atomicrmw are 9838 visible at agent scope. 9839 9840 2. s_waitcnt lgkmcnt(0) & 9841 vmcnt(0) 9842 9843 - If TgSplit execution mode, 9844 omit lgkmcnt(0). 9845 - If OpenCL, omit 9846 lgkmcnt(0). 9847 - Could be split into 9848 separate s_waitcnt 9849 vmcnt(0) and 9850 s_waitcnt 9851 lgkmcnt(0) to allow 9852 them to be 9853 independently moved 9854 according to the 9855 following rules. 9856 - s_waitcnt vmcnt(0) 9857 must happen after 9858 any preceding 9859 global/generic 9860 load/store/load 9861 atomic/store 9862 atomic/atomicrmw. 9863 - s_waitcnt lgkmcnt(0) 9864 must happen after 9865 any preceding 9866 local/generic 9867 load/store/load 9868 atomic/store 9869 atomic/atomicrmw. 9870 - Must happen before 9871 the following 9872 atomicrmw. 9873 - Ensures that all 9874 memory operations 9875 to global and local 9876 have completed 9877 before performing 9878 the atomicrmw that 9879 is being released. 9880 9881 3. buffer/global/flat_atomic sc1=1 9882 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9883 - generic 9884 - Must happen before 9885 following s_waitcnt. 9886 - Performs L2 writeback to 9887 ensure previous 9888 global/generic 9889 store/atomicrmw are 9890 visible at system scope. 9891 9892 2. s_waitcnt lgkmcnt(0) & 9893 vmcnt(0) 9894 9895 - If TgSplit execution mode, 9896 omit lgkmcnt(0). 9897 - If OpenCL, omit 9898 lgkmcnt(0). 9899 - Could be split into 9900 separate s_waitcnt 9901 vmcnt(0) and 9902 s_waitcnt 9903 lgkmcnt(0) to allow 9904 them to be 9905 independently moved 9906 according to the 9907 following rules. 9908 - s_waitcnt vmcnt(0) 9909 must happen after 9910 any preceding 9911 global/generic 9912 load/store/load 9913 atomic/store 9914 atomic/atomicrmw. 9915 - s_waitcnt lgkmcnt(0) 9916 must happen after 9917 any preceding 9918 local/generic 9919 load/store/load 9920 atomic/store 9921 atomic/atomicrmw. 9922 - Must happen before 9923 the following 9924 atomicrmw. 9925 - Ensures that all 9926 memory operations 9927 to memory and the L2 9928 writeback have 9929 completed before 9930 performing the 9931 store that is being 9932 released. 9933 9934 3. buffer/global/flat_atomic 9935 sc0=1 sc1=1 9936 fence release - singlethread *none* *none* 9937 - wavefront 9938 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9939 9940 - Use lgkmcnt(0) if not 9941 TgSplit execution mode 9942 and vmcnt(0) if TgSplit 9943 execution mode. 9944 - If OpenCL and 9945 address space is 9946 not generic, omit 9947 lgkmcnt(0). 9948 - If OpenCL and 9949 address space is 9950 local, omit 9951 vmcnt(0). 9952 - However, since LLVM 9953 currently has no 9954 address space on 9955 the fence need to 9956 conservatively 9957 always generate. If 9958 fence had an 9959 address space then 9960 set to address 9961 space of OpenCL 9962 fence flag, or to 9963 generic if both 9964 local and global 9965 flags are 9966 specified. 9967 - s_waitcnt vmcnt(0) 9968 must happen after 9969 any preceding 9970 global/generic 9971 load/store/ 9972 load atomic/store atomic/ 9973 atomicrmw. 9974 - s_waitcnt lgkmcnt(0) 9975 must happen after 9976 any preceding 9977 local/generic 9978 load/load 9979 atomic/store/store 9980 atomic/atomicrmw. 9981 - Must happen before 9982 any following store 9983 atomic/atomicrmw 9984 with an equal or 9985 wider sync scope 9986 and memory ordering 9987 stronger than 9988 unordered (this is 9989 termed the 9990 fence-paired-atomic). 9991 - Ensures that all 9992 memory operations 9993 have 9994 completed before 9995 performing the 9996 following 9997 fence-paired-atomic. 9998 9999 fence release - agent *none* 1. buffer_wbl2 sc1=1 10000 10001 - If OpenCL and 10002 address space is 10003 local, omit. 10004 - Must happen before 10005 following s_waitcnt. 10006 - Performs L2 writeback to 10007 ensure previous 10008 global/generic 10009 store/atomicrmw are 10010 visible at agent scope. 10011 10012 2. s_waitcnt lgkmcnt(0) & 10013 vmcnt(0) 10014 10015 - If TgSplit execution mode, 10016 omit lgkmcnt(0). 10017 - If OpenCL and 10018 address space is 10019 not generic, omit 10020 lgkmcnt(0). 10021 - If OpenCL and 10022 address space is 10023 local, omit 10024 vmcnt(0). 10025 - However, since LLVM 10026 currently has no 10027 address space on 10028 the fence need to 10029 conservatively 10030 always generate. If 10031 fence had an 10032 address space then 10033 set to address 10034 space of OpenCL 10035 fence flag, or to 10036 generic if both 10037 local and global 10038 flags are 10039 specified. 10040 - Could be split into 10041 separate s_waitcnt 10042 vmcnt(0) and 10043 s_waitcnt 10044 lgkmcnt(0) to allow 10045 them to be 10046 independently moved 10047 according to the 10048 following rules. 10049 - s_waitcnt vmcnt(0) 10050 must happen after 10051 any preceding 10052 global/generic 10053 load/store/load 10054 atomic/store 10055 atomic/atomicrmw. 10056 - s_waitcnt lgkmcnt(0) 10057 must happen after 10058 any preceding 10059 local/generic 10060 load/store/load 10061 atomic/store 10062 atomic/atomicrmw. 10063 - Must happen before 10064 any following store 10065 atomic/atomicrmw 10066 with an equal or 10067 wider sync scope 10068 and memory ordering 10069 stronger than 10070 unordered (this is 10071 termed the 10072 fence-paired-atomic). 10073 - Ensures that all 10074 memory operations 10075 have 10076 completed before 10077 performing the 10078 following 10079 fence-paired-atomic. 10080 10081 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1 10082 10083 - Must happen before 10084 following s_waitcnt. 10085 - Performs L2 writeback to 10086 ensure previous 10087 global/generic 10088 store/atomicrmw are 10089 visible at system scope. 10090 10091 2. s_waitcnt lgkmcnt(0) & 10092 vmcnt(0) 10093 10094 - If TgSplit execution mode, 10095 omit lgkmcnt(0). 10096 - If OpenCL and 10097 address space is 10098 not generic, omit 10099 lgkmcnt(0). 10100 - If OpenCL and 10101 address space is 10102 local, omit 10103 vmcnt(0). 10104 - However, since LLVM 10105 currently has no 10106 address space on 10107 the fence need to 10108 conservatively 10109 always generate. If 10110 fence had an 10111 address space then 10112 set to address 10113 space of OpenCL 10114 fence flag, or to 10115 generic if both 10116 local and global 10117 flags are 10118 specified. 10119 - Could be split into 10120 separate s_waitcnt 10121 vmcnt(0) and 10122 s_waitcnt 10123 lgkmcnt(0) to allow 10124 them to be 10125 independently moved 10126 according to the 10127 following rules. 10128 - s_waitcnt vmcnt(0) 10129 must happen after 10130 any preceding 10131 global/generic 10132 load/store/load 10133 atomic/store 10134 atomic/atomicrmw. 10135 - s_waitcnt lgkmcnt(0) 10136 must happen after 10137 any preceding 10138 local/generic 10139 load/store/load 10140 atomic/store 10141 atomic/atomicrmw. 10142 - Must happen before 10143 any following store 10144 atomic/atomicrmw 10145 with an equal or 10146 wider sync scope 10147 and memory ordering 10148 stronger than 10149 unordered (this is 10150 termed the 10151 fence-paired-atomic). 10152 - Ensures that all 10153 memory operations 10154 have 10155 completed before 10156 performing the 10157 following 10158 fence-paired-atomic. 10159 10160 **Acquire-Release Atomic** 10161 ------------------------------------------------------------------------------------ 10162 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 10163 - wavefront - generic 10164 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 10165 - wavefront local address space cannot 10166 be used.* 10167 10168 1. ds_atomic 10169 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10170 10171 - Use lgkmcnt(0) if not 10172 TgSplit execution mode 10173 and vmcnt(0) if TgSplit 10174 execution mode. 10175 - If OpenCL, omit 10176 lgkmcnt(0). 10177 - Must happen after 10178 any preceding 10179 local/generic 10180 load/store/load 10181 atomic/store 10182 atomic/atomicrmw. 10183 - s_waitcnt vmcnt(0) 10184 must happen after 10185 any preceding 10186 global/generic load/store/ 10187 load atomic/store atomic/ 10188 atomicrmw. 10189 - s_waitcnt lgkmcnt(0) 10190 must happen after 10191 any preceding 10192 local/generic 10193 load/store/load 10194 atomic/store 10195 atomic/atomicrmw. 10196 - Must happen before 10197 the following 10198 atomicrmw. 10199 - Ensures that all 10200 memory operations 10201 have 10202 completed before 10203 performing the 10204 atomicrmw that is 10205 being released. 10206 10207 2. buffer/global_atomic 10208 3. s_waitcnt vmcnt(0) 10209 10210 - If not TgSplit execution 10211 mode, omit. 10212 - Must happen before 10213 the following 10214 buffer_inv. 10215 - Ensures any 10216 following global 10217 data read is no 10218 older than the 10219 atomicrmw value 10220 being acquired. 10221 10222 4. buffer_inv sc0=1 10223 10224 - If not TgSplit execution 10225 mode, omit. 10226 - Ensures that 10227 following 10228 loads will not see 10229 stale data. 10230 10231 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 10232 local address space cannot 10233 be used.* 10234 10235 1. ds_atomic 10236 2. s_waitcnt lgkmcnt(0) 10237 10238 - If OpenCL, omit. 10239 - Must happen before 10240 any following 10241 global/generic 10242 load/load 10243 atomic/store/store 10244 atomic/atomicrmw. 10245 - Ensures any 10246 following global 10247 data read is no 10248 older than the local load 10249 atomic value being 10250 acquired. 10251 10252 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 10253 10254 - Use lgkmcnt(0) if not 10255 TgSplit execution mode 10256 and vmcnt(0) if TgSplit 10257 execution mode. 10258 - If OpenCL, omit 10259 lgkmcnt(0). 10260 - s_waitcnt vmcnt(0) 10261 must happen after 10262 any preceding 10263 global/generic load/store/ 10264 load atomic/store atomic/ 10265 atomicrmw. 10266 - s_waitcnt lgkmcnt(0) 10267 must happen after 10268 any preceding 10269 local/generic 10270 load/store/load 10271 atomic/store 10272 atomic/atomicrmw. 10273 - Must happen before 10274 the following 10275 atomicrmw. 10276 - Ensures that all 10277 memory operations 10278 have 10279 completed before 10280 performing the 10281 atomicrmw that is 10282 being released. 10283 10284 2. flat_atomic 10285 3. s_waitcnt lgkmcnt(0) & 10286 vmcnt(0) 10287 10288 - If not TgSplit execution 10289 mode, omit vmcnt(0). 10290 - If OpenCL, omit 10291 lgkmcnt(0). 10292 - Must happen before 10293 the following 10294 buffer_inv and 10295 any following 10296 global/generic 10297 load/load 10298 atomic/store/store 10299 atomic/atomicrmw. 10300 - Ensures any 10301 following global 10302 data read is no 10303 older than a local load 10304 atomic value being 10305 acquired. 10306 10307 3. buffer_inv sc0=1 10308 10309 - If not TgSplit execution 10310 mode, omit. 10311 - Ensures that 10312 following 10313 loads will not see 10314 stale data. 10315 10316 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1 10317 10318 - Must happen before 10319 following s_waitcnt. 10320 - Performs L2 writeback to 10321 ensure previous 10322 global/generic 10323 store/atomicrmw are 10324 visible at agent scope. 10325 10326 2. s_waitcnt lgkmcnt(0) & 10327 vmcnt(0) 10328 10329 - If TgSplit execution mode, 10330 omit lgkmcnt(0). 10331 - If OpenCL, omit 10332 lgkmcnt(0). 10333 - Could be split into 10334 separate s_waitcnt 10335 vmcnt(0) and 10336 s_waitcnt 10337 lgkmcnt(0) to allow 10338 them to be 10339 independently moved 10340 according to the 10341 following rules. 10342 - s_waitcnt vmcnt(0) 10343 must happen after 10344 any preceding 10345 global/generic 10346 load/store/load 10347 atomic/store 10348 atomic/atomicrmw. 10349 - s_waitcnt lgkmcnt(0) 10350 must happen after 10351 any preceding 10352 local/generic 10353 load/store/load 10354 atomic/store 10355 atomic/atomicrmw. 10356 - Must happen before 10357 the following 10358 atomicrmw. 10359 - Ensures that all 10360 memory operations 10361 to global have 10362 completed before 10363 performing the 10364 atomicrmw that is 10365 being released. 10366 10367 3. buffer/global_atomic 10368 4. s_waitcnt vmcnt(0) 10369 10370 - Must happen before 10371 following 10372 buffer_inv. 10373 - Ensures the 10374 atomicrmw has 10375 completed before 10376 invalidating the 10377 cache. 10378 10379 5. buffer_inv sc1=1 10380 10381 - Must happen before 10382 any following 10383 global/generic 10384 load/load 10385 atomic/atomicrmw. 10386 - Ensures that 10387 following loads 10388 will not see stale 10389 global data. 10390 10391 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1 10392 10393 - Must happen before 10394 following s_waitcnt. 10395 - Performs L2 writeback to 10396 ensure previous 10397 global/generic 10398 store/atomicrmw are 10399 visible at system scope. 10400 10401 2. s_waitcnt lgkmcnt(0) & 10402 vmcnt(0) 10403 10404 - If TgSplit execution mode, 10405 omit lgkmcnt(0). 10406 - If OpenCL, omit 10407 lgkmcnt(0). 10408 - Could be split into 10409 separate s_waitcnt 10410 vmcnt(0) and 10411 s_waitcnt 10412 lgkmcnt(0) to allow 10413 them to be 10414 independently moved 10415 according to the 10416 following rules. 10417 - s_waitcnt vmcnt(0) 10418 must happen after 10419 any preceding 10420 global/generic 10421 load/store/load 10422 atomic/store 10423 atomic/atomicrmw. 10424 - s_waitcnt lgkmcnt(0) 10425 must happen after 10426 any preceding 10427 local/generic 10428 load/store/load 10429 atomic/store 10430 atomic/atomicrmw. 10431 - Must happen before 10432 the following 10433 atomicrmw. 10434 - Ensures that all 10435 memory operations 10436 to global and L2 writeback 10437 have completed before 10438 performing the 10439 atomicrmw that is 10440 being released. 10441 10442 3. buffer/global_atomic 10443 sc1=1 10444 4. s_waitcnt vmcnt(0) 10445 10446 - Must happen before 10447 following 10448 buffer_inv. 10449 - Ensures the 10450 atomicrmw has 10451 completed before 10452 invalidating the 10453 caches. 10454 10455 5. buffer_inv sc0=1 sc1=1 10456 10457 - Must happen before 10458 any following 10459 global/generic 10460 load/load 10461 atomic/atomicrmw. 10462 - Ensures that 10463 following loads 10464 will not see stale 10465 MTYPE NC global data. 10466 MTYPE RW and CC memory will 10467 never be stale due to the 10468 memory probes. 10469 10470 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1 10471 10472 - Must happen before 10473 following s_waitcnt. 10474 - Performs L2 writeback to 10475 ensure previous 10476 global/generic 10477 store/atomicrmw are 10478 visible at agent scope. 10479 10480 2. s_waitcnt lgkmcnt(0) & 10481 vmcnt(0) 10482 10483 - If TgSplit execution mode, 10484 omit lgkmcnt(0). 10485 - If OpenCL, omit 10486 lgkmcnt(0). 10487 - Could be split into 10488 separate s_waitcnt 10489 vmcnt(0) and 10490 s_waitcnt 10491 lgkmcnt(0) to allow 10492 them to be 10493 independently moved 10494 according to the 10495 following rules. 10496 - s_waitcnt vmcnt(0) 10497 must happen after 10498 any preceding 10499 global/generic 10500 load/store/load 10501 atomic/store 10502 atomic/atomicrmw. 10503 - s_waitcnt lgkmcnt(0) 10504 must happen after 10505 any preceding 10506 local/generic 10507 load/store/load 10508 atomic/store 10509 atomic/atomicrmw. 10510 - Must happen before 10511 the following 10512 atomicrmw. 10513 - Ensures that all 10514 memory operations 10515 to global have 10516 completed before 10517 performing the 10518 atomicrmw that is 10519 being released. 10520 10521 3. flat_atomic 10522 4. s_waitcnt vmcnt(0) & 10523 lgkmcnt(0) 10524 10525 - If TgSplit execution mode, 10526 omit lgkmcnt(0). 10527 - If OpenCL, omit 10528 lgkmcnt(0). 10529 - Must happen before 10530 following 10531 buffer_inv. 10532 - Ensures the 10533 atomicrmw has 10534 completed before 10535 invalidating the 10536 cache. 10537 10538 5. buffer_inv sc1=1 10539 10540 - Must happen before 10541 any following 10542 global/generic 10543 load/load 10544 atomic/atomicrmw. 10545 - Ensures that 10546 following loads 10547 will not see stale 10548 global data. 10549 10550 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1 10551 10552 - Must happen before 10553 following s_waitcnt. 10554 - Performs L2 writeback to 10555 ensure previous 10556 global/generic 10557 store/atomicrmw are 10558 visible at system scope. 10559 10560 2. s_waitcnt lgkmcnt(0) & 10561 vmcnt(0) 10562 10563 - If TgSplit execution mode, 10564 omit lgkmcnt(0). 10565 - If OpenCL, omit 10566 lgkmcnt(0). 10567 - Could be split into 10568 separate s_waitcnt 10569 vmcnt(0) and 10570 s_waitcnt 10571 lgkmcnt(0) to allow 10572 them to be 10573 independently moved 10574 according to the 10575 following rules. 10576 - s_waitcnt vmcnt(0) 10577 must happen after 10578 any preceding 10579 global/generic 10580 load/store/load 10581 atomic/store 10582 atomic/atomicrmw. 10583 - s_waitcnt lgkmcnt(0) 10584 must happen after 10585 any preceding 10586 local/generic 10587 load/store/load 10588 atomic/store 10589 atomic/atomicrmw. 10590 - Must happen before 10591 the following 10592 atomicrmw. 10593 - Ensures that all 10594 memory operations 10595 to global and L2 writeback 10596 have completed before 10597 performing the 10598 atomicrmw that is 10599 being released. 10600 10601 3. flat_atomic sc1=1 10602 4. s_waitcnt vmcnt(0) & 10603 lgkmcnt(0) 10604 10605 - If TgSplit execution mode, 10606 omit lgkmcnt(0). 10607 - If OpenCL, omit 10608 lgkmcnt(0). 10609 - Must happen before 10610 following 10611 buffer_inv. 10612 - Ensures the 10613 atomicrmw has 10614 completed before 10615 invalidating the 10616 caches. 10617 10618 5. buffer_inv sc0=1 sc1=1 10619 10620 - Must happen before 10621 any following 10622 global/generic 10623 load/load 10624 atomic/atomicrmw. 10625 - Ensures that 10626 following loads 10627 will not see stale 10628 MTYPE NC global data. 10629 MTYPE RW and CC memory will 10630 never be stale due to the 10631 memory probes. 10632 10633 fence acq_rel - singlethread *none* *none* 10634 - wavefront 10635 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 10636 10637 - Use lgkmcnt(0) if not 10638 TgSplit execution mode 10639 and vmcnt(0) if TgSplit 10640 execution mode. 10641 - If OpenCL and 10642 address space is 10643 not generic, omit 10644 lgkmcnt(0). 10645 - If OpenCL and 10646 address space is 10647 local, omit 10648 vmcnt(0). 10649 - However, 10650 since LLVM 10651 currently has no 10652 address space on 10653 the fence need to 10654 conservatively 10655 always generate 10656 (see comment for 10657 previous fence). 10658 - s_waitcnt vmcnt(0) 10659 must happen after 10660 any preceding 10661 global/generic 10662 load/store/ 10663 load atomic/store atomic/ 10664 atomicrmw. 10665 - s_waitcnt lgkmcnt(0) 10666 must happen after 10667 any preceding 10668 local/generic 10669 load/load 10670 atomic/store/store 10671 atomic/atomicrmw. 10672 - Must happen before 10673 any following 10674 global/generic 10675 load/load 10676 atomic/store/store 10677 atomic/atomicrmw. 10678 - Ensures that all 10679 memory operations 10680 have 10681 completed before 10682 performing any 10683 following global 10684 memory operations. 10685 - Ensures that the 10686 preceding 10687 local/generic load 10688 atomic/atomicrmw 10689 with an equal or 10690 wider sync scope 10691 and memory ordering 10692 stronger than 10693 unordered (this is 10694 termed the 10695 acquire-fence-paired-atomic) 10696 has completed 10697 before following 10698 global memory 10699 operations. This 10700 satisfies the 10701 requirements of 10702 acquire. 10703 - Ensures that all 10704 previous memory 10705 operations have 10706 completed before a 10707 following 10708 local/generic store 10709 atomic/atomicrmw 10710 with an equal or 10711 wider sync scope 10712 and memory ordering 10713 stronger than 10714 unordered (this is 10715 termed the 10716 release-fence-paired-atomic). 10717 This satisfies the 10718 requirements of 10719 release. 10720 - Must happen before 10721 the following 10722 buffer_inv. 10723 - Ensures that the 10724 acquire-fence-paired 10725 atomic has completed 10726 before invalidating 10727 the 10728 cache. Therefore 10729 any following 10730 locations read must 10731 be no older than 10732 the value read by 10733 the 10734 acquire-fence-paired-atomic. 10735 10736 3. buffer_inv sc0=1 10737 10738 - If not TgSplit execution 10739 mode, omit. 10740 - Ensures that 10741 following 10742 loads will not see 10743 stale data. 10744 10745 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1 10746 10747 - If OpenCL and 10748 address space is 10749 local, omit. 10750 - Must happen before 10751 following s_waitcnt. 10752 - Performs L2 writeback to 10753 ensure previous 10754 global/generic 10755 store/atomicrmw are 10756 visible at agent scope. 10757 10758 2. s_waitcnt lgkmcnt(0) & 10759 vmcnt(0) 10760 10761 - If TgSplit execution mode, 10762 omit lgkmcnt(0). 10763 - If OpenCL and 10764 address space is 10765 not generic, omit 10766 lgkmcnt(0). 10767 - However, since LLVM 10768 currently has no 10769 address space on 10770 the fence need to 10771 conservatively 10772 always generate 10773 (see comment for 10774 previous fence). 10775 - Could be split into 10776 separate s_waitcnt 10777 vmcnt(0) and 10778 s_waitcnt 10779 lgkmcnt(0) to allow 10780 them to be 10781 independently moved 10782 according to the 10783 following rules. 10784 - s_waitcnt vmcnt(0) 10785 must happen after 10786 any preceding 10787 global/generic 10788 load/store/load 10789 atomic/store 10790 atomic/atomicrmw. 10791 - s_waitcnt lgkmcnt(0) 10792 must happen after 10793 any preceding 10794 local/generic 10795 load/store/load 10796 atomic/store 10797 atomic/atomicrmw. 10798 - Must happen before 10799 the following 10800 buffer_inv. 10801 - Ensures that the 10802 preceding 10803 global/local/generic 10804 load 10805 atomic/atomicrmw 10806 with an equal or 10807 wider sync scope 10808 and memory ordering 10809 stronger than 10810 unordered (this is 10811 termed the 10812 acquire-fence-paired-atomic) 10813 has completed 10814 before invalidating 10815 the cache. This 10816 satisfies the 10817 requirements of 10818 acquire. 10819 - Ensures that all 10820 previous memory 10821 operations have 10822 completed before a 10823 following 10824 global/local/generic 10825 store 10826 atomic/atomicrmw 10827 with an equal or 10828 wider sync scope 10829 and memory ordering 10830 stronger than 10831 unordered (this is 10832 termed the 10833 release-fence-paired-atomic). 10834 This satisfies the 10835 requirements of 10836 release. 10837 10838 3. buffer_inv sc1=1 10839 10840 - Must happen before 10841 any following 10842 global/generic 10843 load/load 10844 atomic/store/store 10845 atomic/atomicrmw. 10846 - Ensures that 10847 following loads 10848 will not see stale 10849 global data. This 10850 satisfies the 10851 requirements of 10852 acquire. 10853 10854 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1 10855 10856 - If OpenCL and 10857 address space is 10858 local, omit. 10859 - Must happen before 10860 following s_waitcnt. 10861 - Performs L2 writeback to 10862 ensure previous 10863 global/generic 10864 store/atomicrmw are 10865 visible at system scope. 10866 10867 1. s_waitcnt lgkmcnt(0) & 10868 vmcnt(0) 10869 10870 - If TgSplit execution mode, 10871 omit lgkmcnt(0). 10872 - If OpenCL and 10873 address space is 10874 not generic, omit 10875 lgkmcnt(0). 10876 - However, since LLVM 10877 currently has no 10878 address space on 10879 the fence need to 10880 conservatively 10881 always generate 10882 (see comment for 10883 previous fence). 10884 - Could be split into 10885 separate s_waitcnt 10886 vmcnt(0) and 10887 s_waitcnt 10888 lgkmcnt(0) to allow 10889 them to be 10890 independently moved 10891 according to the 10892 following rules. 10893 - s_waitcnt vmcnt(0) 10894 must happen after 10895 any preceding 10896 global/generic 10897 load/store/load 10898 atomic/store 10899 atomic/atomicrmw. 10900 - s_waitcnt lgkmcnt(0) 10901 must happen after 10902 any preceding 10903 local/generic 10904 load/store/load 10905 atomic/store 10906 atomic/atomicrmw. 10907 - Must happen before 10908 the following 10909 buffer_inv. 10910 - Ensures that the 10911 preceding 10912 global/local/generic 10913 load 10914 atomic/atomicrmw 10915 with an equal or 10916 wider sync scope 10917 and memory ordering 10918 stronger than 10919 unordered (this is 10920 termed the 10921 acquire-fence-paired-atomic) 10922 has completed 10923 before invalidating 10924 the cache. This 10925 satisfies the 10926 requirements of 10927 acquire. 10928 - Ensures that all 10929 previous memory 10930 operations have 10931 completed before a 10932 following 10933 global/local/generic 10934 store 10935 atomic/atomicrmw 10936 with an equal or 10937 wider sync scope 10938 and memory ordering 10939 stronger than 10940 unordered (this is 10941 termed the 10942 release-fence-paired-atomic). 10943 This satisfies the 10944 requirements of 10945 release. 10946 10947 2. buffer_inv sc0=1 sc1=1 10948 10949 - Must happen before 10950 any following 10951 global/generic 10952 load/load 10953 atomic/store/store 10954 atomic/atomicrmw. 10955 - Ensures that 10956 following loads 10957 will not see stale 10958 MTYPE NC global data. 10959 MTYPE RW and CC memory will 10960 never be stale due to the 10961 memory probes. 10962 10963 **Sequential Consistent Atomic** 10964 ------------------------------------------------------------------------------------ 10965 load atomic seq_cst - singlethread - global *Same as corresponding 10966 - wavefront - local load atomic acquire, 10967 - generic except must generate 10968 all instructions even 10969 for OpenCL.* 10970 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10971 - generic 10972 - Use lgkmcnt(0) if not 10973 TgSplit execution mode 10974 and vmcnt(0) if TgSplit 10975 execution mode. 10976 - s_waitcnt lgkmcnt(0) must 10977 happen after 10978 preceding 10979 local/generic load 10980 atomic/store 10981 atomic/atomicrmw 10982 with memory 10983 ordering of seq_cst 10984 and with equal or 10985 wider sync scope. 10986 (Note that seq_cst 10987 fences have their 10988 own s_waitcnt 10989 lgkmcnt(0) and so do 10990 not need to be 10991 considered.) 10992 - s_waitcnt vmcnt(0) 10993 must happen after 10994 preceding 10995 global/generic load 10996 atomic/store 10997 atomic/atomicrmw 10998 with memory 10999 ordering of seq_cst 11000 and with equal or 11001 wider sync scope. 11002 (Note that seq_cst 11003 fences have their 11004 own s_waitcnt 11005 vmcnt(0) and so do 11006 not need to be 11007 considered.) 11008 - Ensures any 11009 preceding 11010 sequential 11011 consistent global/local 11012 memory instructions 11013 have completed 11014 before executing 11015 this sequentially 11016 consistent 11017 instruction. This 11018 prevents reordering 11019 a seq_cst store 11020 followed by a 11021 seq_cst load. (Note 11022 that seq_cst is 11023 stronger than 11024 acquire/release as 11025 the reordering of 11026 load acquire 11027 followed by a store 11028 release is 11029 prevented by the 11030 s_waitcnt of 11031 the release, but 11032 there is nothing 11033 preventing a store 11034 release followed by 11035 load acquire from 11036 completing out of 11037 order. The s_waitcnt 11038 could be placed after 11039 seq_store or before 11040 the seq_load. We 11041 choose the load to 11042 make the s_waitcnt be 11043 as late as possible 11044 so that the store 11045 may have already 11046 completed.) 11047 11048 2. *Following 11049 instructions same as 11050 corresponding load 11051 atomic acquire, 11052 except must generate 11053 all instructions even 11054 for OpenCL.* 11055 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 11056 local address space cannot 11057 be used.* 11058 11059 *Same as corresponding 11060 load atomic acquire, 11061 except must generate 11062 all instructions even 11063 for OpenCL.* 11064 11065 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 11066 - system - generic vmcnt(0) 11067 11068 - If TgSplit execution mode, 11069 omit lgkmcnt(0). 11070 - Could be split into 11071 separate s_waitcnt 11072 vmcnt(0) 11073 and s_waitcnt 11074 lgkmcnt(0) to allow 11075 them to be 11076 independently moved 11077 according to the 11078 following rules. 11079 - s_waitcnt lgkmcnt(0) 11080 must happen after 11081 preceding 11082 global/generic load 11083 atomic/store 11084 atomic/atomicrmw 11085 with memory 11086 ordering of seq_cst 11087 and with equal or 11088 wider sync scope. 11089 (Note that seq_cst 11090 fences have their 11091 own s_waitcnt 11092 lgkmcnt(0) and so do 11093 not need to be 11094 considered.) 11095 - s_waitcnt vmcnt(0) 11096 must happen after 11097 preceding 11098 global/generic load 11099 atomic/store 11100 atomic/atomicrmw 11101 with memory 11102 ordering of seq_cst 11103 and with equal or 11104 wider sync scope. 11105 (Note that seq_cst 11106 fences have their 11107 own s_waitcnt 11108 vmcnt(0) and so do 11109 not need to be 11110 considered.) 11111 - Ensures any 11112 preceding 11113 sequential 11114 consistent global 11115 memory instructions 11116 have completed 11117 before executing 11118 this sequentially 11119 consistent 11120 instruction. This 11121 prevents reordering 11122 a seq_cst store 11123 followed by a 11124 seq_cst load. (Note 11125 that seq_cst is 11126 stronger than 11127 acquire/release as 11128 the reordering of 11129 load acquire 11130 followed by a store 11131 release is 11132 prevented by the 11133 s_waitcnt of 11134 the release, but 11135 there is nothing 11136 preventing a store 11137 release followed by 11138 load acquire from 11139 completing out of 11140 order. The s_waitcnt 11141 could be placed after 11142 seq_store or before 11143 the seq_load. We 11144 choose the load to 11145 make the s_waitcnt be 11146 as late as possible 11147 so that the store 11148 may have already 11149 completed.) 11150 11151 2. *Following 11152 instructions same as 11153 corresponding load 11154 atomic acquire, 11155 except must generate 11156 all instructions even 11157 for OpenCL.* 11158 store atomic seq_cst - singlethread - global *Same as corresponding 11159 - wavefront - local store atomic release, 11160 - workgroup - generic except must generate 11161 - agent all instructions even 11162 - system for OpenCL.* 11163 atomicrmw seq_cst - singlethread - global *Same as corresponding 11164 - wavefront - local atomicrmw acq_rel, 11165 - workgroup - generic except must generate 11166 - agent all instructions even 11167 - system for OpenCL.* 11168 fence seq_cst - singlethread *none* *Same as corresponding 11169 - wavefront fence acq_rel, 11170 - workgroup except must generate 11171 - agent all instructions even 11172 - system for OpenCL.* 11173 ============ ============ ============== ========== ================================ 11174 11175.. _amdgpu-amdhsa-memory-model-gfx10-gfx11: 11176 11177Memory Model GFX10-GFX11 11178++++++++++++++++++++++++ 11179 11180For GFX10-GFX11: 11181 11182* Each agent has multiple shader arrays (SA). 11183* Each SA has multiple work-group processors (WGP). 11184* Each WGP has multiple compute units (CU). 11185* Each CU has multiple SIMDs that execute wavefronts. 11186* The wavefronts for a single work-group are executed in the same 11187 WGP. In CU wavefront execution mode the wavefronts may be executed by 11188 different SIMDs in the same CU. In WGP wavefront execution mode the 11189 wavefronts may be executed by different SIMDs in different CUs in the same 11190 WGP. 11191* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 11192 executing on it. 11193* All LDS operations of a WGP are performed as wavefront wide operations in a 11194 global order and involve no caching. Completion is reported to a wavefront in 11195 execution order. 11196* The LDS memory has multiple request queues shared by the SIMDs of a 11197 WGP. Therefore, the LDS operations performed by different wavefronts of a 11198 work-group can be reordered relative to each other, which can result in 11199 reordering the visibility of vector memory operations with respect to LDS 11200 operations of other wavefronts in the same work-group. A ``s_waitcnt 11201 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 11202 vector memory operations between wavefronts of a work-group, but not between 11203 operations performed by the same wavefront. 11204* The vector memory operations are performed as wavefront wide operations. 11205 Completion of load/store/sample operations are reported to a wavefront in 11206 execution order of other load/store/sample operations performed by that 11207 wavefront. 11208* The vector memory operations access a vector L0 cache. There is a single L0 11209 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 11210 special action is required for coherence between the lanes of a single 11211 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 11212 wavefronts executing in the same work-group as they may be executing on SIMDs 11213 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 11214 required for coherence between wavefronts executing in different work-groups 11215 as they may be executing on different WGPs. 11216* The scalar memory operations access a scalar L0 cache shared by all wavefronts 11217 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 11218 operations are used in a restricted way so do not impact the memory model. See 11219 :ref:`amdgpu-amdhsa-memory-spaces`. 11220* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 11221 the same SA. Therefore, no special action is required for coherence between 11222 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 11223 required for coherence between wavefronts executing in different work-groups 11224 as they may be executing on different SAs that access different L1s. 11225* The L1 caches have independent quadrants to service disjoint ranges of virtual 11226 addresses. 11227* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 11228 vector and scalar memory operations performed by different wavefronts, whether 11229 executing in the same or different work-groups (which may be executing on 11230 different CUs accessing different L0s), can be reordered relative to each 11231 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 11232 synchronization between vector memory operations of different wavefronts. It 11233 ensures a previous vector memory operation has completed before executing a 11234 subsequent vector memory or LDS operation and so can be used to meet the 11235 requirements of acquire, release and sequential consistency. 11236* The L1 caches use an L2 cache shared by all SAs on the same agent. 11237* The L2 cache has independent channels to service disjoint ranges of virtual 11238 addresses. 11239* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 11240 quadrant has a separate request queue per L2 channel. Therefore, the vector 11241 and scalar memory operations performed by wavefronts executing in different 11242 work-groups (which may be executing on different SAs) of an agent can be 11243 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 11244 required to ensure synchronization between vector memory operations of 11245 different SAs. It ensures a previous vector memory operation has completed 11246 before executing a subsequent vector memory and so can be used to meet the 11247 requirements of acquire, release and sequential consistency. 11248* The L2 cache can be kept coherent with other agents on some targets, or ranges 11249 of virtual addresses can be set up to bypass it to ensure system coherence. 11250* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory. 11251 The MALL cache is fully coherent with GPU memory and has no impact on system 11252 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 11253 11254Scalar memory operations are only used to access memory that is proven to not 11255change during the execution of the kernel dispatch. This includes constant 11256address space and global address space for program scope ``const`` variables. 11257Therefore, the kernel machine code does not have to maintain the scalar cache to 11258ensure it is coherent with the vector caches. The scalar and vector caches are 11259invalidated between kernel dispatches by CP since constant address space data 11260may change between kernel dispatch executions. See 11261:ref:`amdgpu-amdhsa-memory-spaces`. 11262 11263The one exception is if scalar writes are used to spill SGPR registers. In this 11264case the AMDGPU backend ensures the memory location used to spill is never 11265accessed by vector memory operations at the same time. If scalar writes are used 11266then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 11267return since the locations may be used for vector memory instructions by a 11268future wavefront that uses the same scratch area, or a function call that 11269creates a frame at the same address, respectively. There is no need for a 11270``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 11271 11272For kernarg backing memory: 11273 11274* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 11275* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 11276 needing to invalidate the L2 cache. 11277* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 11278 so the L2 cache will be coherent with the CPU and other agents. 11279 11280Scratch backing memory (which is used for the private address space) is accessed 11281with MTYPE NC (non-coherent). Since the private address space is only accessed 11282by a single thread, and is always write-before-read, there is never a need to 11283invalidate these entries from the L0 or L1 caches. 11284 11285Wavefronts are executed in native mode with in-order reporting of loads and 11286sample instructions. In this mode vmcnt reports completion of load, atomic with 11287return and sample instructions in order, and the vscnt reports the completion of 11288store and atomic without return in order. See ``MEM_ORDERED`` field in 11289:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 11290 11291Wavefronts can be executed in WGP or CU wavefront execution mode: 11292 11293* In WGP wavefront execution mode the wavefronts of a work-group are executed 11294 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 11295 CU L0 caches is required for work-group synchronization. Also accesses to L1 11296 at work-group scope need to be explicitly ordered as the accesses from 11297 different CUs are not ordered. 11298* In CU wavefront execution mode the wavefronts of a work-group are executed on 11299 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 11300 the work-group access the same L0 which in turn ensures L1 accesses are 11301 ordered and so do not require explicit management of the caches for 11302 work-group synchronization. 11303 11304See ``WGP_MODE`` field in 11305:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and 11306:ref:`amdgpu-target-features`. 11307 11308The code sequences used to implement the memory model for GFX10-GFX11 are defined in 11309table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`. 11310 11311 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11 11312 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table 11313 11314 ============ ============ ============== ========== ================================ 11315 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 11316 Ordering Sync Scope Address GFX10-GFX11 11317 Space 11318 ============ ============ ============== ========== ================================ 11319 **Non-Atomic** 11320 ------------------------------------------------------------------------------------ 11321 load *none* *none* - global - !volatile & !nontemporal 11322 - generic 11323 - private 1. buffer/global/flat_load 11324 - constant 11325 - !volatile & nontemporal 11326 11327 1. buffer/global/flat_load 11328 slc=1 dlc=1 11329 11330 - If GFX10, omit dlc=1. 11331 11332 - volatile 11333 11334 1. buffer/global/flat_load 11335 glc=1 dlc=1 11336 11337 2. s_waitcnt vmcnt(0) 11338 11339 - Must happen before 11340 any following volatile 11341 global/generic 11342 load/store. 11343 - Ensures that 11344 volatile 11345 operations to 11346 different 11347 addresses will not 11348 be reordered by 11349 hardware. 11350 11351 load *none* *none* - local 1. ds_load 11352 store *none* *none* - global - !volatile & !nontemporal 11353 - generic 11354 - private 1. buffer/global/flat_store 11355 - constant 11356 - !volatile & nontemporal 11357 11358 1. buffer/global/flat_store 11359 glc=1 slc=1 dlc=1 11360 11361 - If GFX10, omit dlc=1. 11362 11363 - volatile 11364 11365 1. buffer/global/flat_store 11366 dlc=1 11367 11368 - If GFX10, omit dlc=1. 11369 11370 2. s_waitcnt vscnt(0) 11371 11372 - Must happen before 11373 any following volatile 11374 global/generic 11375 load/store. 11376 - Ensures that 11377 volatile 11378 operations to 11379 different 11380 addresses will not 11381 be reordered by 11382 hardware. 11383 11384 store *none* *none* - local 1. ds_store 11385 **Unordered Atomic** 11386 ------------------------------------------------------------------------------------ 11387 load atomic unordered *any* *any* *Same as non-atomic*. 11388 store atomic unordered *any* *any* *Same as non-atomic*. 11389 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 11390 **Monotonic Atomic** 11391 ------------------------------------------------------------------------------------ 11392 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 11393 - wavefront - generic 11394 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 11395 - generic glc=1 11396 11397 - If CU wavefront execution 11398 mode, omit glc=1. 11399 11400 load atomic monotonic - singlethread - local 1. ds_load 11401 - wavefront 11402 - workgroup 11403 load atomic monotonic - agent - global 1. buffer/global/flat_load 11404 - system - generic glc=1 dlc=1 11405 11406 - If GFX11, omit dlc=1. 11407 11408 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 11409 - wavefront - generic 11410 - workgroup 11411 - agent 11412 - system 11413 store atomic monotonic - singlethread - local 1. ds_store 11414 - wavefront 11415 - workgroup 11416 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 11417 - wavefront - generic 11418 - workgroup 11419 - agent 11420 - system 11421 atomicrmw monotonic - singlethread - local 1. ds_atomic 11422 - wavefront 11423 - workgroup 11424 **Acquire Atomic** 11425 ------------------------------------------------------------------------------------ 11426 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 11427 - wavefront - local 11428 - generic 11429 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 11430 11431 - If CU wavefront execution 11432 mode, omit glc=1. 11433 11434 2. s_waitcnt vmcnt(0) 11435 11436 - If CU wavefront execution 11437 mode, omit. 11438 - Must happen before 11439 the following buffer_gl0_inv 11440 and before any following 11441 global/generic 11442 load/load 11443 atomic/store/store 11444 atomic/atomicrmw. 11445 11446 3. buffer_gl0_inv 11447 11448 - If CU wavefront execution 11449 mode, omit. 11450 - Ensures that 11451 following 11452 loads will not see 11453 stale data. 11454 11455 load atomic acquire - workgroup - local 1. ds_load 11456 2. s_waitcnt lgkmcnt(0) 11457 11458 - If OpenCL, omit. 11459 - Must happen before 11460 the following buffer_gl0_inv 11461 and before any following 11462 global/generic load/load 11463 atomic/store/store 11464 atomic/atomicrmw. 11465 - Ensures any 11466 following global 11467 data read is no 11468 older than the local load 11469 atomic value being 11470 acquired. 11471 11472 3. buffer_gl0_inv 11473 11474 - If CU wavefront execution 11475 mode, omit. 11476 - If OpenCL, omit. 11477 - Ensures that 11478 following 11479 loads will not see 11480 stale data. 11481 11482 load atomic acquire - workgroup - generic 1. flat_load glc=1 11483 11484 - If CU wavefront execution 11485 mode, omit glc=1. 11486 11487 2. s_waitcnt lgkmcnt(0) & 11488 vmcnt(0) 11489 11490 - If CU wavefront execution 11491 mode, omit vmcnt(0). 11492 - If OpenCL, omit 11493 lgkmcnt(0). 11494 - Must happen before 11495 the following 11496 buffer_gl0_inv and any 11497 following global/generic 11498 load/load 11499 atomic/store/store 11500 atomic/atomicrmw. 11501 - Ensures any 11502 following global 11503 data read is no 11504 older than a local load 11505 atomic value being 11506 acquired. 11507 11508 3. buffer_gl0_inv 11509 11510 - If CU wavefront execution 11511 mode, omit. 11512 - Ensures that 11513 following 11514 loads will not see 11515 stale data. 11516 11517 load atomic acquire - agent - global 1. buffer/global_load 11518 - system glc=1 dlc=1 11519 11520 - If GFX11, omit dlc=1. 11521 11522 2. s_waitcnt vmcnt(0) 11523 11524 - Must happen before 11525 following 11526 buffer_gl*_inv. 11527 - Ensures the load 11528 has completed 11529 before invalidating 11530 the caches. 11531 11532 3. buffer_gl0_inv; 11533 buffer_gl1_inv 11534 11535 - Must happen before 11536 any following 11537 global/generic 11538 load/load 11539 atomic/atomicrmw. 11540 - Ensures that 11541 following 11542 loads will not see 11543 stale global data. 11544 11545 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 11546 - system 11547 - If GFX11, omit dlc=1. 11548 11549 2. s_waitcnt vmcnt(0) & 11550 lgkmcnt(0) 11551 11552 - If OpenCL omit 11553 lgkmcnt(0). 11554 - Must happen before 11555 following 11556 buffer_gl*_invl. 11557 - Ensures the flat_load 11558 has completed 11559 before invalidating 11560 the caches. 11561 11562 3. buffer_gl0_inv; 11563 buffer_gl1_inv 11564 11565 - Must happen before 11566 any following 11567 global/generic 11568 load/load 11569 atomic/atomicrmw. 11570 - Ensures that 11571 following loads 11572 will not see stale 11573 global data. 11574 11575 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 11576 - wavefront - local 11577 - generic 11578 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 11579 2. s_waitcnt vm/vscnt(0) 11580 11581 - If CU wavefront execution 11582 mode, omit. 11583 - Use vmcnt(0) if atomic with 11584 return and vscnt(0) if 11585 atomic with no-return. 11586 - Must happen before 11587 the following buffer_gl0_inv 11588 and before any following 11589 global/generic 11590 load/load 11591 atomic/store/store 11592 atomic/atomicrmw. 11593 11594 3. buffer_gl0_inv 11595 11596 - If CU wavefront execution 11597 mode, omit. 11598 - Ensures that 11599 following 11600 loads will not see 11601 stale data. 11602 11603 atomicrmw acquire - workgroup - local 1. ds_atomic 11604 2. s_waitcnt lgkmcnt(0) 11605 11606 - If OpenCL, omit. 11607 - Must happen before 11608 the following 11609 buffer_gl0_inv. 11610 - Ensures any 11611 following global 11612 data read is no 11613 older than the local 11614 atomicrmw value 11615 being acquired. 11616 11617 3. buffer_gl0_inv 11618 11619 - If OpenCL omit. 11620 - Ensures that 11621 following 11622 loads will not see 11623 stale data. 11624 11625 atomicrmw acquire - workgroup - generic 1. flat_atomic 11626 2. s_waitcnt lgkmcnt(0) & 11627 vm/vscnt(0) 11628 11629 - If CU wavefront execution 11630 mode, omit vm/vscnt(0). 11631 - If OpenCL, omit lgkmcnt(0). 11632 - Use vmcnt(0) if atomic with 11633 return and vscnt(0) if 11634 atomic with no-return. 11635 - Must happen before 11636 the following 11637 buffer_gl0_inv. 11638 - Ensures any 11639 following global 11640 data read is no 11641 older than a local 11642 atomicrmw value 11643 being acquired. 11644 11645 3. buffer_gl0_inv 11646 11647 - If CU wavefront execution 11648 mode, omit. 11649 - Ensures that 11650 following 11651 loads will not see 11652 stale data. 11653 11654 atomicrmw acquire - agent - global 1. buffer/global_atomic 11655 - system 2. s_waitcnt vm/vscnt(0) 11656 11657 - Use vmcnt(0) if atomic with 11658 return and vscnt(0) if 11659 atomic with no-return. 11660 - Must happen before 11661 following 11662 buffer_gl*_inv. 11663 - Ensures the 11664 atomicrmw has 11665 completed before 11666 invalidating the 11667 caches. 11668 11669 3. buffer_gl0_inv; 11670 buffer_gl1_inv 11671 11672 - Must happen before 11673 any following 11674 global/generic 11675 load/load 11676 atomic/atomicrmw. 11677 - Ensures that 11678 following loads 11679 will not see stale 11680 global data. 11681 11682 atomicrmw acquire - agent - generic 1. flat_atomic 11683 - system 2. s_waitcnt vm/vscnt(0) & 11684 lgkmcnt(0) 11685 11686 - If OpenCL, omit 11687 lgkmcnt(0). 11688 - Use vmcnt(0) if atomic with 11689 return and vscnt(0) if 11690 atomic with no-return. 11691 - Must happen before 11692 following 11693 buffer_gl*_inv. 11694 - Ensures the 11695 atomicrmw has 11696 completed before 11697 invalidating the 11698 caches. 11699 11700 3. buffer_gl0_inv; 11701 buffer_gl1_inv 11702 11703 - Must happen before 11704 any following 11705 global/generic 11706 load/load 11707 atomic/atomicrmw. 11708 - Ensures that 11709 following loads 11710 will not see stale 11711 global data. 11712 11713 fence acquire - singlethread *none* *none* 11714 - wavefront 11715 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 11716 vmcnt(0) & vscnt(0) 11717 11718 - If CU wavefront execution 11719 mode, omit vmcnt(0) and 11720 vscnt(0). 11721 - If OpenCL and 11722 address space is 11723 not generic, omit 11724 lgkmcnt(0). 11725 - If OpenCL and 11726 address space is 11727 local, omit 11728 vmcnt(0) and vscnt(0). 11729 - However, since LLVM 11730 currently has no 11731 address space on 11732 the fence need to 11733 conservatively 11734 always generate. If 11735 fence had an 11736 address space then 11737 set to address 11738 space of OpenCL 11739 fence flag, or to 11740 generic if both 11741 local and global 11742 flags are 11743 specified. 11744 - Could be split into 11745 separate s_waitcnt 11746 vmcnt(0), s_waitcnt 11747 vscnt(0) and s_waitcnt 11748 lgkmcnt(0) to allow 11749 them to be 11750 independently moved 11751 according to the 11752 following rules. 11753 - s_waitcnt vmcnt(0) 11754 must happen after 11755 any preceding 11756 global/generic load 11757 atomic/ 11758 atomicrmw-with-return-value 11759 with an equal or 11760 wider sync scope 11761 and memory ordering 11762 stronger than 11763 unordered (this is 11764 termed the 11765 fence-paired-atomic). 11766 - s_waitcnt vscnt(0) 11767 must happen after 11768 any preceding 11769 global/generic 11770 atomicrmw-no-return-value 11771 with an equal or 11772 wider sync scope 11773 and memory ordering 11774 stronger than 11775 unordered (this is 11776 termed the 11777 fence-paired-atomic). 11778 - s_waitcnt lgkmcnt(0) 11779 must happen after 11780 any preceding 11781 local/generic load 11782 atomic/atomicrmw 11783 with an equal or 11784 wider sync scope 11785 and memory ordering 11786 stronger than 11787 unordered (this is 11788 termed the 11789 fence-paired-atomic). 11790 - Must happen before 11791 the following 11792 buffer_gl0_inv. 11793 - Ensures that the 11794 fence-paired atomic 11795 has completed 11796 before invalidating 11797 the 11798 cache. Therefore 11799 any following 11800 locations read must 11801 be no older than 11802 the value read by 11803 the 11804 fence-paired-atomic. 11805 11806 3. buffer_gl0_inv 11807 11808 - If CU wavefront execution 11809 mode, omit. 11810 - Ensures that 11811 following 11812 loads will not see 11813 stale data. 11814 11815 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 11816 - system vmcnt(0) & vscnt(0) 11817 11818 - If OpenCL and 11819 address space is 11820 not generic, omit 11821 lgkmcnt(0). 11822 - If OpenCL and 11823 address space is 11824 local, omit 11825 vmcnt(0) and vscnt(0). 11826 - However, since LLVM 11827 currently has no 11828 address space on 11829 the fence need to 11830 conservatively 11831 always generate 11832 (see comment for 11833 previous fence). 11834 - Could be split into 11835 separate s_waitcnt 11836 vmcnt(0), s_waitcnt 11837 vscnt(0) and s_waitcnt 11838 lgkmcnt(0) to allow 11839 them to be 11840 independently moved 11841 according to the 11842 following rules. 11843 - s_waitcnt vmcnt(0) 11844 must happen after 11845 any preceding 11846 global/generic load 11847 atomic/ 11848 atomicrmw-with-return-value 11849 with an equal or 11850 wider sync scope 11851 and memory ordering 11852 stronger than 11853 unordered (this is 11854 termed the 11855 fence-paired-atomic). 11856 - s_waitcnt vscnt(0) 11857 must happen after 11858 any preceding 11859 global/generic 11860 atomicrmw-no-return-value 11861 with an equal or 11862 wider sync scope 11863 and memory ordering 11864 stronger than 11865 unordered (this is 11866 termed the 11867 fence-paired-atomic). 11868 - s_waitcnt lgkmcnt(0) 11869 must happen after 11870 any preceding 11871 local/generic load 11872 atomic/atomicrmw 11873 with an equal or 11874 wider sync scope 11875 and memory ordering 11876 stronger than 11877 unordered (this is 11878 termed the 11879 fence-paired-atomic). 11880 - Must happen before 11881 the following 11882 buffer_gl*_inv. 11883 - Ensures that the 11884 fence-paired atomic 11885 has completed 11886 before invalidating 11887 the 11888 caches. Therefore 11889 any following 11890 locations read must 11891 be no older than 11892 the value read by 11893 the 11894 fence-paired-atomic. 11895 11896 2. buffer_gl0_inv; 11897 buffer_gl1_inv 11898 11899 - Must happen before any 11900 following global/generic 11901 load/load 11902 atomic/store/store 11903 atomic/atomicrmw. 11904 - Ensures that 11905 following loads 11906 will not see stale 11907 global data. 11908 11909 **Release Atomic** 11910 ------------------------------------------------------------------------------------ 11911 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 11912 - wavefront - local 11913 - generic 11914 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 11915 - generic vmcnt(0) & vscnt(0) 11916 11917 - If CU wavefront execution 11918 mode, omit vmcnt(0) and 11919 vscnt(0). 11920 - If OpenCL, omit 11921 lgkmcnt(0). 11922 - Could be split into 11923 separate s_waitcnt 11924 vmcnt(0), s_waitcnt 11925 vscnt(0) and s_waitcnt 11926 lgkmcnt(0) to allow 11927 them to be 11928 independently moved 11929 according to the 11930 following rules. 11931 - s_waitcnt vmcnt(0) 11932 must happen after 11933 any preceding 11934 global/generic load/load 11935 atomic/ 11936 atomicrmw-with-return-value. 11937 - s_waitcnt vscnt(0) 11938 must happen after 11939 any preceding 11940 global/generic 11941 store/store 11942 atomic/ 11943 atomicrmw-no-return-value. 11944 - s_waitcnt lgkmcnt(0) 11945 must happen after 11946 any preceding 11947 local/generic 11948 load/store/load 11949 atomic/store 11950 atomic/atomicrmw. 11951 - Must happen before 11952 the following 11953 store. 11954 - Ensures that all 11955 memory operations 11956 have 11957 completed before 11958 performing the 11959 store that is being 11960 released. 11961 11962 2. buffer/global/flat_store 11963 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 11964 11965 - If CU wavefront execution 11966 mode, omit. 11967 - If OpenCL, omit. 11968 - Could be split into 11969 separate s_waitcnt 11970 vmcnt(0) and s_waitcnt 11971 vscnt(0) to allow 11972 them to be 11973 independently moved 11974 according to the 11975 following rules. 11976 - s_waitcnt vmcnt(0) 11977 must happen after 11978 any preceding 11979 global/generic load/load 11980 atomic/ 11981 atomicrmw-with-return-value. 11982 - s_waitcnt vscnt(0) 11983 must happen after 11984 any preceding 11985 global/generic 11986 store/store atomic/ 11987 atomicrmw-no-return-value. 11988 - Must happen before 11989 the following 11990 store. 11991 - Ensures that all 11992 global memory 11993 operations have 11994 completed before 11995 performing the 11996 store that is being 11997 released. 11998 11999 2. ds_store 12000 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 12001 - system - generic vmcnt(0) & vscnt(0) 12002 12003 - If OpenCL and 12004 address space is 12005 not generic, omit 12006 lgkmcnt(0). 12007 - Could be split into 12008 separate s_waitcnt 12009 vmcnt(0), s_waitcnt vscnt(0) 12010 and s_waitcnt 12011 lgkmcnt(0) to allow 12012 them to be 12013 independently moved 12014 according to the 12015 following rules. 12016 - s_waitcnt vmcnt(0) 12017 must happen after 12018 any preceding 12019 global/generic 12020 load/load 12021 atomic/ 12022 atomicrmw-with-return-value. 12023 - s_waitcnt vscnt(0) 12024 must happen after 12025 any preceding 12026 global/generic 12027 store/store atomic/ 12028 atomicrmw-no-return-value. 12029 - s_waitcnt lgkmcnt(0) 12030 must happen after 12031 any preceding 12032 local/generic 12033 load/store/load 12034 atomic/store 12035 atomic/atomicrmw. 12036 - Must happen before 12037 the following 12038 store. 12039 - Ensures that all 12040 memory operations 12041 have 12042 completed before 12043 performing the 12044 store that is being 12045 released. 12046 12047 2. buffer/global/flat_store 12048 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 12049 - wavefront - local 12050 - generic 12051 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12052 - generic vmcnt(0) & vscnt(0) 12053 12054 - If CU wavefront execution 12055 mode, omit vmcnt(0) and 12056 vscnt(0). 12057 - If OpenCL, omit lgkmcnt(0). 12058 - Could be split into 12059 separate s_waitcnt 12060 vmcnt(0), s_waitcnt 12061 vscnt(0) and s_waitcnt 12062 lgkmcnt(0) to allow 12063 them to be 12064 independently moved 12065 according to the 12066 following rules. 12067 - s_waitcnt vmcnt(0) 12068 must happen after 12069 any preceding 12070 global/generic load/load 12071 atomic/ 12072 atomicrmw-with-return-value. 12073 - s_waitcnt vscnt(0) 12074 must happen after 12075 any preceding 12076 global/generic 12077 store/store 12078 atomic/ 12079 atomicrmw-no-return-value. 12080 - s_waitcnt lgkmcnt(0) 12081 must happen after 12082 any preceding 12083 local/generic 12084 load/store/load 12085 atomic/store 12086 atomic/atomicrmw. 12087 - Must happen before 12088 the following 12089 atomicrmw. 12090 - Ensures that all 12091 memory operations 12092 have 12093 completed before 12094 performing the 12095 atomicrmw that is 12096 being released. 12097 12098 2. buffer/global/flat_atomic 12099 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12100 12101 - If CU wavefront execution 12102 mode, omit. 12103 - If OpenCL, omit. 12104 - Could be split into 12105 separate s_waitcnt 12106 vmcnt(0) and s_waitcnt 12107 vscnt(0) to allow 12108 them to be 12109 independently moved 12110 according to the 12111 following rules. 12112 - s_waitcnt vmcnt(0) 12113 must happen after 12114 any preceding 12115 global/generic load/load 12116 atomic/ 12117 atomicrmw-with-return-value. 12118 - s_waitcnt vscnt(0) 12119 must happen after 12120 any preceding 12121 global/generic 12122 store/store atomic/ 12123 atomicrmw-no-return-value. 12124 - Must happen before 12125 the following 12126 store. 12127 - Ensures that all 12128 global memory 12129 operations have 12130 completed before 12131 performing the 12132 store that is being 12133 released. 12134 12135 2. ds_atomic 12136 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 12137 - system - generic vmcnt(0) & vscnt(0) 12138 12139 - If OpenCL, omit 12140 lgkmcnt(0). 12141 - Could be split into 12142 separate s_waitcnt 12143 vmcnt(0), s_waitcnt 12144 vscnt(0) and s_waitcnt 12145 lgkmcnt(0) to allow 12146 them to be 12147 independently moved 12148 according to the 12149 following rules. 12150 - s_waitcnt vmcnt(0) 12151 must happen after 12152 any preceding 12153 global/generic 12154 load/load atomic/ 12155 atomicrmw-with-return-value. 12156 - s_waitcnt vscnt(0) 12157 must happen after 12158 any preceding 12159 global/generic 12160 store/store atomic/ 12161 atomicrmw-no-return-value. 12162 - s_waitcnt lgkmcnt(0) 12163 must happen after 12164 any preceding 12165 local/generic 12166 load/store/load 12167 atomic/store 12168 atomic/atomicrmw. 12169 - Must happen before 12170 the following 12171 atomicrmw. 12172 - Ensures that all 12173 memory operations 12174 to global and local 12175 have completed 12176 before performing 12177 the atomicrmw that 12178 is being released. 12179 12180 2. buffer/global/flat_atomic 12181 fence release - singlethread *none* *none* 12182 - wavefront 12183 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12184 vmcnt(0) & vscnt(0) 12185 12186 - If CU wavefront execution 12187 mode, omit vmcnt(0) and 12188 vscnt(0). 12189 - If OpenCL and 12190 address space is 12191 not generic, omit 12192 lgkmcnt(0). 12193 - If OpenCL and 12194 address space is 12195 local, omit 12196 vmcnt(0) and vscnt(0). 12197 - However, since LLVM 12198 currently has no 12199 address space on 12200 the fence need to 12201 conservatively 12202 always generate. If 12203 fence had an 12204 address space then 12205 set to address 12206 space of OpenCL 12207 fence flag, or to 12208 generic if both 12209 local and global 12210 flags are 12211 specified. 12212 - Could be split into 12213 separate s_waitcnt 12214 vmcnt(0), s_waitcnt 12215 vscnt(0) and s_waitcnt 12216 lgkmcnt(0) to allow 12217 them to be 12218 independently moved 12219 according to the 12220 following rules. 12221 - s_waitcnt vmcnt(0) 12222 must happen after 12223 any preceding 12224 global/generic 12225 load/load 12226 atomic/ 12227 atomicrmw-with-return-value. 12228 - s_waitcnt vscnt(0) 12229 must happen after 12230 any preceding 12231 global/generic 12232 store/store atomic/ 12233 atomicrmw-no-return-value. 12234 - s_waitcnt lgkmcnt(0) 12235 must happen after 12236 any preceding 12237 local/generic 12238 load/store/load 12239 atomic/store atomic/ 12240 atomicrmw. 12241 - Must happen before 12242 any following store 12243 atomic/atomicrmw 12244 with an equal or 12245 wider sync scope 12246 and memory ordering 12247 stronger than 12248 unordered (this is 12249 termed the 12250 fence-paired-atomic). 12251 - Ensures that all 12252 memory operations 12253 have 12254 completed before 12255 performing the 12256 following 12257 fence-paired-atomic. 12258 12259 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 12260 - system vmcnt(0) & vscnt(0) 12261 12262 - If OpenCL and 12263 address space is 12264 not generic, omit 12265 lgkmcnt(0). 12266 - If OpenCL and 12267 address space is 12268 local, omit 12269 vmcnt(0) and vscnt(0). 12270 - However, since LLVM 12271 currently has no 12272 address space on 12273 the fence need to 12274 conservatively 12275 always generate. If 12276 fence had an 12277 address space then 12278 set to address 12279 space of OpenCL 12280 fence flag, or to 12281 generic if both 12282 local and global 12283 flags are 12284 specified. 12285 - Could be split into 12286 separate s_waitcnt 12287 vmcnt(0), s_waitcnt 12288 vscnt(0) and s_waitcnt 12289 lgkmcnt(0) to allow 12290 them to be 12291 independently moved 12292 according to the 12293 following rules. 12294 - s_waitcnt vmcnt(0) 12295 must happen after 12296 any preceding 12297 global/generic 12298 load/load atomic/ 12299 atomicrmw-with-return-value. 12300 - s_waitcnt vscnt(0) 12301 must happen after 12302 any preceding 12303 global/generic 12304 store/store atomic/ 12305 atomicrmw-no-return-value. 12306 - s_waitcnt lgkmcnt(0) 12307 must happen after 12308 any preceding 12309 local/generic 12310 load/store/load 12311 atomic/store 12312 atomic/atomicrmw. 12313 - Must happen before 12314 any following store 12315 atomic/atomicrmw 12316 with an equal or 12317 wider sync scope 12318 and memory ordering 12319 stronger than 12320 unordered (this is 12321 termed the 12322 fence-paired-atomic). 12323 - Ensures that all 12324 memory operations 12325 have 12326 completed before 12327 performing the 12328 following 12329 fence-paired-atomic. 12330 12331 **Acquire-Release Atomic** 12332 ------------------------------------------------------------------------------------ 12333 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 12334 - wavefront - local 12335 - generic 12336 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12337 vmcnt(0) & vscnt(0) 12338 12339 - If CU wavefront execution 12340 mode, omit vmcnt(0) and 12341 vscnt(0). 12342 - If OpenCL, omit 12343 lgkmcnt(0). 12344 - Must happen after 12345 any preceding 12346 local/generic 12347 load/store/load 12348 atomic/store 12349 atomic/atomicrmw. 12350 - Could be split into 12351 separate s_waitcnt 12352 vmcnt(0), s_waitcnt 12353 vscnt(0), and s_waitcnt 12354 lgkmcnt(0) to allow 12355 them to be 12356 independently moved 12357 according to the 12358 following rules. 12359 - s_waitcnt vmcnt(0) 12360 must happen after 12361 any preceding 12362 global/generic load/load 12363 atomic/ 12364 atomicrmw-with-return-value. 12365 - s_waitcnt vscnt(0) 12366 must happen after 12367 any preceding 12368 global/generic 12369 store/store 12370 atomic/ 12371 atomicrmw-no-return-value. 12372 - s_waitcnt lgkmcnt(0) 12373 must happen after 12374 any preceding 12375 local/generic 12376 load/store/load 12377 atomic/store 12378 atomic/atomicrmw. 12379 - Must happen before 12380 the following 12381 atomicrmw. 12382 - Ensures that all 12383 memory operations 12384 have 12385 completed before 12386 performing the 12387 atomicrmw that is 12388 being released. 12389 12390 2. buffer/global_atomic 12391 3. s_waitcnt vm/vscnt(0) 12392 12393 - If CU wavefront execution 12394 mode, omit. 12395 - Use vmcnt(0) if atomic with 12396 return and vscnt(0) if 12397 atomic with no-return. 12398 - Must happen before 12399 the following 12400 buffer_gl0_inv. 12401 - Ensures any 12402 following global 12403 data read is no 12404 older than the 12405 atomicrmw value 12406 being acquired. 12407 12408 4. buffer_gl0_inv 12409 12410 - If CU wavefront execution 12411 mode, omit. 12412 - Ensures that 12413 following 12414 loads will not see 12415 stale data. 12416 12417 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12418 12419 - If CU wavefront execution 12420 mode, omit. 12421 - If OpenCL, omit. 12422 - Could be split into 12423 separate s_waitcnt 12424 vmcnt(0) and s_waitcnt 12425 vscnt(0) to allow 12426 them to be 12427 independently moved 12428 according to the 12429 following rules. 12430 - s_waitcnt vmcnt(0) 12431 must happen after 12432 any preceding 12433 global/generic load/load 12434 atomic/ 12435 atomicrmw-with-return-value. 12436 - s_waitcnt vscnt(0) 12437 must happen after 12438 any preceding 12439 global/generic 12440 store/store atomic/ 12441 atomicrmw-no-return-value. 12442 - Must happen before 12443 the following 12444 store. 12445 - Ensures that all 12446 global memory 12447 operations have 12448 completed before 12449 performing the 12450 store that is being 12451 released. 12452 12453 2. ds_atomic 12454 3. s_waitcnt lgkmcnt(0) 12455 12456 - If OpenCL, omit. 12457 - Must happen before 12458 the following 12459 buffer_gl0_inv. 12460 - Ensures any 12461 following global 12462 data read is no 12463 older than the local load 12464 atomic value being 12465 acquired. 12466 12467 4. buffer_gl0_inv 12468 12469 - If CU wavefront execution 12470 mode, omit. 12471 - If OpenCL omit. 12472 - Ensures that 12473 following 12474 loads will not see 12475 stale data. 12476 12477 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 12478 vmcnt(0) & vscnt(0) 12479 12480 - If CU wavefront execution 12481 mode, omit vmcnt(0) and 12482 vscnt(0). 12483 - If OpenCL, omit lgkmcnt(0). 12484 - Could be split into 12485 separate s_waitcnt 12486 vmcnt(0), s_waitcnt 12487 vscnt(0) and s_waitcnt 12488 lgkmcnt(0) to allow 12489 them to be 12490 independently moved 12491 according to the 12492 following rules. 12493 - s_waitcnt vmcnt(0) 12494 must happen after 12495 any preceding 12496 global/generic load/load 12497 atomic/ 12498 atomicrmw-with-return-value. 12499 - s_waitcnt vscnt(0) 12500 must happen after 12501 any preceding 12502 global/generic 12503 store/store 12504 atomic/ 12505 atomicrmw-no-return-value. 12506 - s_waitcnt lgkmcnt(0) 12507 must happen after 12508 any preceding 12509 local/generic 12510 load/store/load 12511 atomic/store 12512 atomic/atomicrmw. 12513 - Must happen before 12514 the following 12515 atomicrmw. 12516 - Ensures that all 12517 memory operations 12518 have 12519 completed before 12520 performing the 12521 atomicrmw that is 12522 being released. 12523 12524 2. flat_atomic 12525 3. s_waitcnt lgkmcnt(0) & 12526 vmcnt(0) & vscnt(0) 12527 12528 - If CU wavefront execution 12529 mode, omit vmcnt(0) and 12530 vscnt(0). 12531 - If OpenCL, omit lgkmcnt(0). 12532 - Must happen before 12533 the following 12534 buffer_gl0_inv. 12535 - Ensures any 12536 following global 12537 data read is no 12538 older than the load 12539 atomic value being 12540 acquired. 12541 12542 3. buffer_gl0_inv 12543 12544 - If CU wavefront execution 12545 mode, omit. 12546 - Ensures that 12547 following 12548 loads will not see 12549 stale data. 12550 12551 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 12552 - system vmcnt(0) & vscnt(0) 12553 12554 - If OpenCL, omit 12555 lgkmcnt(0). 12556 - Could be split into 12557 separate s_waitcnt 12558 vmcnt(0), s_waitcnt 12559 vscnt(0) and s_waitcnt 12560 lgkmcnt(0) to allow 12561 them to be 12562 independently moved 12563 according to the 12564 following rules. 12565 - s_waitcnt vmcnt(0) 12566 must happen after 12567 any preceding 12568 global/generic 12569 load/load atomic/ 12570 atomicrmw-with-return-value. 12571 - s_waitcnt vscnt(0) 12572 must happen after 12573 any preceding 12574 global/generic 12575 store/store atomic/ 12576 atomicrmw-no-return-value. 12577 - s_waitcnt lgkmcnt(0) 12578 must happen after 12579 any preceding 12580 local/generic 12581 load/store/load 12582 atomic/store 12583 atomic/atomicrmw. 12584 - Must happen before 12585 the following 12586 atomicrmw. 12587 - Ensures that all 12588 memory operations 12589 to global have 12590 completed before 12591 performing the 12592 atomicrmw that is 12593 being released. 12594 12595 2. buffer/global_atomic 12596 3. s_waitcnt vm/vscnt(0) 12597 12598 - Use vmcnt(0) if atomic with 12599 return and vscnt(0) if 12600 atomic with no-return. 12601 - Must happen before 12602 following 12603 buffer_gl*_inv. 12604 - Ensures the 12605 atomicrmw has 12606 completed before 12607 invalidating the 12608 caches. 12609 12610 4. buffer_gl0_inv; 12611 buffer_gl1_inv 12612 12613 - Must happen before 12614 any following 12615 global/generic 12616 load/load 12617 atomic/atomicrmw. 12618 - Ensures that 12619 following loads 12620 will not see stale 12621 global data. 12622 12623 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 12624 - system vmcnt(0) & vscnt(0) 12625 12626 - If OpenCL, omit 12627 lgkmcnt(0). 12628 - Could be split into 12629 separate s_waitcnt 12630 vmcnt(0), s_waitcnt 12631 vscnt(0), and s_waitcnt 12632 lgkmcnt(0) to allow 12633 them to be 12634 independently moved 12635 according to the 12636 following rules. 12637 - s_waitcnt vmcnt(0) 12638 must happen after 12639 any preceding 12640 global/generic 12641 load/load atomic 12642 atomicrmw-with-return-value. 12643 - s_waitcnt vscnt(0) 12644 must happen after 12645 any preceding 12646 global/generic 12647 store/store atomic/ 12648 atomicrmw-no-return-value. 12649 - s_waitcnt lgkmcnt(0) 12650 must happen after 12651 any preceding 12652 local/generic 12653 load/store/load 12654 atomic/store 12655 atomic/atomicrmw. 12656 - Must happen before 12657 the following 12658 atomicrmw. 12659 - Ensures that all 12660 memory operations 12661 have 12662 completed before 12663 performing the 12664 atomicrmw that is 12665 being released. 12666 12667 2. flat_atomic 12668 3. s_waitcnt vm/vscnt(0) & 12669 lgkmcnt(0) 12670 12671 - If OpenCL, omit 12672 lgkmcnt(0). 12673 - Use vmcnt(0) if atomic with 12674 return and vscnt(0) if 12675 atomic with no-return. 12676 - Must happen before 12677 following 12678 buffer_gl*_inv. 12679 - Ensures the 12680 atomicrmw has 12681 completed before 12682 invalidating the 12683 caches. 12684 12685 4. buffer_gl0_inv; 12686 buffer_gl1_inv 12687 12688 - Must happen before 12689 any following 12690 global/generic 12691 load/load 12692 atomic/atomicrmw. 12693 - Ensures that 12694 following loads 12695 will not see stale 12696 global data. 12697 12698 fence acq_rel - singlethread *none* *none* 12699 - wavefront 12700 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12701 vmcnt(0) & vscnt(0) 12702 12703 - If CU wavefront execution 12704 mode, omit vmcnt(0) and 12705 vscnt(0). 12706 - If OpenCL and 12707 address space is 12708 not generic, omit 12709 lgkmcnt(0). 12710 - If OpenCL and 12711 address space is 12712 local, omit 12713 vmcnt(0) and vscnt(0). 12714 - However, 12715 since LLVM 12716 currently has no 12717 address space on 12718 the fence need to 12719 conservatively 12720 always generate 12721 (see comment for 12722 previous fence). 12723 - Could be split into 12724 separate s_waitcnt 12725 vmcnt(0), s_waitcnt 12726 vscnt(0) and s_waitcnt 12727 lgkmcnt(0) to allow 12728 them to be 12729 independently moved 12730 according to the 12731 following rules. 12732 - s_waitcnt vmcnt(0) 12733 must happen after 12734 any preceding 12735 global/generic 12736 load/load 12737 atomic/ 12738 atomicrmw-with-return-value. 12739 - s_waitcnt vscnt(0) 12740 must happen after 12741 any preceding 12742 global/generic 12743 store/store atomic/ 12744 atomicrmw-no-return-value. 12745 - s_waitcnt lgkmcnt(0) 12746 must happen after 12747 any preceding 12748 local/generic 12749 load/store/load 12750 atomic/store atomic/ 12751 atomicrmw. 12752 - Must happen before 12753 any following 12754 global/generic 12755 load/load 12756 atomic/store/store 12757 atomic/atomicrmw. 12758 - Ensures that all 12759 memory operations 12760 have 12761 completed before 12762 performing any 12763 following global 12764 memory operations. 12765 - Ensures that the 12766 preceding 12767 local/generic load 12768 atomic/atomicrmw 12769 with an equal or 12770 wider sync scope 12771 and memory ordering 12772 stronger than 12773 unordered (this is 12774 termed the 12775 acquire-fence-paired-atomic) 12776 has completed 12777 before following 12778 global memory 12779 operations. This 12780 satisfies the 12781 requirements of 12782 acquire. 12783 - Ensures that all 12784 previous memory 12785 operations have 12786 completed before a 12787 following 12788 local/generic store 12789 atomic/atomicrmw 12790 with an equal or 12791 wider sync scope 12792 and memory ordering 12793 stronger than 12794 unordered (this is 12795 termed the 12796 release-fence-paired-atomic). 12797 This satisfies the 12798 requirements of 12799 release. 12800 - Must happen before 12801 the following 12802 buffer_gl0_inv. 12803 - Ensures that the 12804 acquire-fence-paired 12805 atomic has completed 12806 before invalidating 12807 the 12808 cache. Therefore 12809 any following 12810 locations read must 12811 be no older than 12812 the value read by 12813 the 12814 acquire-fence-paired-atomic. 12815 12816 3. buffer_gl0_inv 12817 12818 - If CU wavefront execution 12819 mode, omit. 12820 - Ensures that 12821 following 12822 loads will not see 12823 stale data. 12824 12825 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 12826 - system vmcnt(0) & vscnt(0) 12827 12828 - If OpenCL and 12829 address space is 12830 not generic, omit 12831 lgkmcnt(0). 12832 - If OpenCL and 12833 address space is 12834 local, omit 12835 vmcnt(0) and vscnt(0). 12836 - However, since LLVM 12837 currently has no 12838 address space on 12839 the fence need to 12840 conservatively 12841 always generate 12842 (see comment for 12843 previous fence). 12844 - Could be split into 12845 separate s_waitcnt 12846 vmcnt(0), s_waitcnt 12847 vscnt(0) and s_waitcnt 12848 lgkmcnt(0) to allow 12849 them to be 12850 independently moved 12851 according to the 12852 following rules. 12853 - s_waitcnt vmcnt(0) 12854 must happen after 12855 any preceding 12856 global/generic 12857 load/load 12858 atomic/ 12859 atomicrmw-with-return-value. 12860 - s_waitcnt vscnt(0) 12861 must happen after 12862 any preceding 12863 global/generic 12864 store/store atomic/ 12865 atomicrmw-no-return-value. 12866 - s_waitcnt lgkmcnt(0) 12867 must happen after 12868 any preceding 12869 local/generic 12870 load/store/load 12871 atomic/store 12872 atomic/atomicrmw. 12873 - Must happen before 12874 the following 12875 buffer_gl*_inv. 12876 - Ensures that the 12877 preceding 12878 global/local/generic 12879 load 12880 atomic/atomicrmw 12881 with an equal or 12882 wider sync scope 12883 and memory ordering 12884 stronger than 12885 unordered (this is 12886 termed the 12887 acquire-fence-paired-atomic) 12888 has completed 12889 before invalidating 12890 the caches. This 12891 satisfies the 12892 requirements of 12893 acquire. 12894 - Ensures that all 12895 previous memory 12896 operations have 12897 completed before a 12898 following 12899 global/local/generic 12900 store 12901 atomic/atomicrmw 12902 with an equal or 12903 wider sync scope 12904 and memory ordering 12905 stronger than 12906 unordered (this is 12907 termed the 12908 release-fence-paired-atomic). 12909 This satisfies the 12910 requirements of 12911 release. 12912 12913 2. buffer_gl0_inv; 12914 buffer_gl1_inv 12915 12916 - Must happen before 12917 any following 12918 global/generic 12919 load/load 12920 atomic/store/store 12921 atomic/atomicrmw. 12922 - Ensures that 12923 following loads 12924 will not see stale 12925 global data. This 12926 satisfies the 12927 requirements of 12928 acquire. 12929 12930 **Sequential Consistent Atomic** 12931 ------------------------------------------------------------------------------------ 12932 load atomic seq_cst - singlethread - global *Same as corresponding 12933 - wavefront - local load atomic acquire, 12934 - generic except must generate 12935 all instructions even 12936 for OpenCL.* 12937 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12938 - generic vmcnt(0) & vscnt(0) 12939 12940 - If CU wavefront execution 12941 mode, omit vmcnt(0) and 12942 vscnt(0). 12943 - Could be split into 12944 separate s_waitcnt 12945 vmcnt(0), s_waitcnt 12946 vscnt(0), and s_waitcnt 12947 lgkmcnt(0) to allow 12948 them to be 12949 independently moved 12950 according to the 12951 following rules. 12952 - s_waitcnt lgkmcnt(0) must 12953 happen after 12954 preceding 12955 local/generic load 12956 atomic/store 12957 atomic/atomicrmw 12958 with memory 12959 ordering of seq_cst 12960 and with equal or 12961 wider sync scope. 12962 (Note that seq_cst 12963 fences have their 12964 own s_waitcnt 12965 lgkmcnt(0) and so do 12966 not need to be 12967 considered.) 12968 - s_waitcnt vmcnt(0) 12969 must happen after 12970 preceding 12971 global/generic load 12972 atomic/ 12973 atomicrmw-with-return-value 12974 with memory 12975 ordering of seq_cst 12976 and with equal or 12977 wider sync scope. 12978 (Note that seq_cst 12979 fences have their 12980 own s_waitcnt 12981 vmcnt(0) and so do 12982 not need to be 12983 considered.) 12984 - s_waitcnt vscnt(0) 12985 Must happen after 12986 preceding 12987 global/generic store 12988 atomic/ 12989 atomicrmw-no-return-value 12990 with memory 12991 ordering of seq_cst 12992 and with equal or 12993 wider sync scope. 12994 (Note that seq_cst 12995 fences have their 12996 own s_waitcnt 12997 vscnt(0) and so do 12998 not need to be 12999 considered.) 13000 - Ensures any 13001 preceding 13002 sequential 13003 consistent global/local 13004 memory instructions 13005 have completed 13006 before executing 13007 this sequentially 13008 consistent 13009 instruction. This 13010 prevents reordering 13011 a seq_cst store 13012 followed by a 13013 seq_cst load. (Note 13014 that seq_cst is 13015 stronger than 13016 acquire/release as 13017 the reordering of 13018 load acquire 13019 followed by a store 13020 release is 13021 prevented by the 13022 s_waitcnt of 13023 the release, but 13024 there is nothing 13025 preventing a store 13026 release followed by 13027 load acquire from 13028 completing out of 13029 order. The s_waitcnt 13030 could be placed after 13031 seq_store or before 13032 the seq_load. We 13033 choose the load to 13034 make the s_waitcnt be 13035 as late as possible 13036 so that the store 13037 may have already 13038 completed.) 13039 13040 2. *Following 13041 instructions same as 13042 corresponding load 13043 atomic acquire, 13044 except must generate 13045 all instructions even 13046 for OpenCL.* 13047 load atomic seq_cst - workgroup - local 13048 13049 1. s_waitcnt vmcnt(0) & vscnt(0) 13050 13051 - If CU wavefront execution 13052 mode, omit. 13053 - Could be split into 13054 separate s_waitcnt 13055 vmcnt(0) and s_waitcnt 13056 vscnt(0) to allow 13057 them to be 13058 independently moved 13059 according to the 13060 following rules. 13061 - s_waitcnt vmcnt(0) 13062 Must happen after 13063 preceding 13064 global/generic load 13065 atomic/ 13066 atomicrmw-with-return-value 13067 with memory 13068 ordering of seq_cst 13069 and with equal or 13070 wider sync scope. 13071 (Note that seq_cst 13072 fences have their 13073 own s_waitcnt 13074 vmcnt(0) and so do 13075 not need to be 13076 considered.) 13077 - s_waitcnt vscnt(0) 13078 Must happen after 13079 preceding 13080 global/generic store 13081 atomic/ 13082 atomicrmw-no-return-value 13083 with memory 13084 ordering of seq_cst 13085 and with equal or 13086 wider sync scope. 13087 (Note that seq_cst 13088 fences have their 13089 own s_waitcnt 13090 vscnt(0) and so do 13091 not need to be 13092 considered.) 13093 - Ensures any 13094 preceding 13095 sequential 13096 consistent global 13097 memory instructions 13098 have completed 13099 before executing 13100 this sequentially 13101 consistent 13102 instruction. This 13103 prevents reordering 13104 a seq_cst store 13105 followed by a 13106 seq_cst load. (Note 13107 that seq_cst is 13108 stronger than 13109 acquire/release as 13110 the reordering of 13111 load acquire 13112 followed by a store 13113 release is 13114 prevented by the 13115 s_waitcnt of 13116 the release, but 13117 there is nothing 13118 preventing a store 13119 release followed by 13120 load acquire from 13121 completing out of 13122 order. The s_waitcnt 13123 could be placed after 13124 seq_store or before 13125 the seq_load. We 13126 choose the load to 13127 make the s_waitcnt be 13128 as late as possible 13129 so that the store 13130 may have already 13131 completed.) 13132 13133 2. *Following 13134 instructions same as 13135 corresponding load 13136 atomic acquire, 13137 except must generate 13138 all instructions even 13139 for OpenCL.* 13140 13141 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 13142 - system - generic vmcnt(0) & vscnt(0) 13143 13144 - Could be split into 13145 separate s_waitcnt 13146 vmcnt(0), s_waitcnt 13147 vscnt(0) and s_waitcnt 13148 lgkmcnt(0) to allow 13149 them to be 13150 independently moved 13151 according to the 13152 following rules. 13153 - s_waitcnt lgkmcnt(0) 13154 must happen after 13155 preceding 13156 local load 13157 atomic/store 13158 atomic/atomicrmw 13159 with memory 13160 ordering of seq_cst 13161 and with equal or 13162 wider sync scope. 13163 (Note that seq_cst 13164 fences have their 13165 own s_waitcnt 13166 lgkmcnt(0) and so do 13167 not need to be 13168 considered.) 13169 - s_waitcnt vmcnt(0) 13170 must happen after 13171 preceding 13172 global/generic load 13173 atomic/ 13174 atomicrmw-with-return-value 13175 with memory 13176 ordering of seq_cst 13177 and with equal or 13178 wider sync scope. 13179 (Note that seq_cst 13180 fences have their 13181 own s_waitcnt 13182 vmcnt(0) and so do 13183 not need to be 13184 considered.) 13185 - s_waitcnt vscnt(0) 13186 Must happen after 13187 preceding 13188 global/generic store 13189 atomic/ 13190 atomicrmw-no-return-value 13191 with memory 13192 ordering of seq_cst 13193 and with equal or 13194 wider sync scope. 13195 (Note that seq_cst 13196 fences have their 13197 own s_waitcnt 13198 vscnt(0) and so do 13199 not need to be 13200 considered.) 13201 - Ensures any 13202 preceding 13203 sequential 13204 consistent global 13205 memory instructions 13206 have completed 13207 before executing 13208 this sequentially 13209 consistent 13210 instruction. This 13211 prevents reordering 13212 a seq_cst store 13213 followed by a 13214 seq_cst load. (Note 13215 that seq_cst is 13216 stronger than 13217 acquire/release as 13218 the reordering of 13219 load acquire 13220 followed by a store 13221 release is 13222 prevented by the 13223 s_waitcnt of 13224 the release, but 13225 there is nothing 13226 preventing a store 13227 release followed by 13228 load acquire from 13229 completing out of 13230 order. The s_waitcnt 13231 could be placed after 13232 seq_store or before 13233 the seq_load. We 13234 choose the load to 13235 make the s_waitcnt be 13236 as late as possible 13237 so that the store 13238 may have already 13239 completed.) 13240 13241 2. *Following 13242 instructions same as 13243 corresponding load 13244 atomic acquire, 13245 except must generate 13246 all instructions even 13247 for OpenCL.* 13248 store atomic seq_cst - singlethread - global *Same as corresponding 13249 - wavefront - local store atomic release, 13250 - workgroup - generic except must generate 13251 - agent all instructions even 13252 - system for OpenCL.* 13253 atomicrmw seq_cst - singlethread - global *Same as corresponding 13254 - wavefront - local atomicrmw acq_rel, 13255 - workgroup - generic except must generate 13256 - agent all instructions even 13257 - system for OpenCL.* 13258 fence seq_cst - singlethread *none* *Same as corresponding 13259 - wavefront fence acq_rel, 13260 - workgroup except must generate 13261 - agent all instructions even 13262 - system for OpenCL.* 13263 ============ ============ ============== ========== ================================ 13264 13265.. _amdgpu-amdhsa-trap-handler-abi: 13266 13267Trap Handler ABI 13268~~~~~~~~~~~~~~~~ 13269 13270For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 13271runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 13272supports the ``s_trap`` instruction. For usage see: 13273 13274- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 13275- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 13276- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table` 13277 13278 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 13279 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 13280 13281 =================== =============== =============== ======================================= 13282 Usage Code Sequence Trap Handler Description 13283 Inputs 13284 =================== =============== =============== ======================================= 13285 reserved ``s_trap 0x00`` Reserved by hardware. 13286 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 13287 ``queue_ptr`` intrinsic (not implemented). 13288 ``VGPR0``: 13289 ``arg`` 13290 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13291 ``queue_ptr`` the trap instruction. The associated 13292 queue is signalled to put it into the 13293 error state. When the queue is put in 13294 the error state, the waves executing 13295 dispatches on the queue will be 13296 terminated. 13297 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13298 as a no-operation. The trap handler 13299 is entered and immediately returns to 13300 continue execution of the wavefront. 13301 - If the debugger is enabled, causes 13302 the debug trap to be reported by the 13303 debugger and the wavefront is put in 13304 the halt state with the PC at the 13305 instruction. The debugger must 13306 increment the PC and resume the wave. 13307 reserved ``s_trap 0x04`` Reserved. 13308 reserved ``s_trap 0x05`` Reserved. 13309 reserved ``s_trap 0x06`` Reserved. 13310 reserved ``s_trap 0x07`` Reserved. 13311 reserved ``s_trap 0x08`` Reserved. 13312 reserved ``s_trap 0xfe`` Reserved. 13313 reserved ``s_trap 0xff`` Reserved. 13314 =================== =============== =============== ======================================= 13315 13316.. 13317 13318 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 13319 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 13320 13321 =================== =============== =============== ======================================= 13322 Usage Code Sequence Trap Handler Description 13323 Inputs 13324 =================== =============== =============== ======================================= 13325 reserved ``s_trap 0x00`` Reserved by hardware. 13326 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 13327 breakpoints. Causes wave to be halted 13328 with the PC at the trap instruction. 13329 The debugger is responsible to resume 13330 the wave, including the instruction 13331 that the breakpoint overwrote. 13332 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13333 ``queue_ptr`` the trap instruction. The associated 13334 queue is signalled to put it into the 13335 error state. When the queue is put in 13336 the error state, the waves executing 13337 dispatches on the queue will be 13338 terminated. 13339 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13340 as a no-operation. The trap handler 13341 is entered and immediately returns to 13342 continue execution of the wavefront. 13343 - If the debugger is enabled, causes 13344 the debug trap to be reported by the 13345 debugger and the wavefront is put in 13346 the halt state with the PC at the 13347 instruction. The debugger must 13348 increment the PC and resume the wave. 13349 reserved ``s_trap 0x04`` Reserved. 13350 reserved ``s_trap 0x05`` Reserved. 13351 reserved ``s_trap 0x06`` Reserved. 13352 reserved ``s_trap 0x07`` Reserved. 13353 reserved ``s_trap 0x08`` Reserved. 13354 reserved ``s_trap 0xfe`` Reserved. 13355 reserved ``s_trap 0xff`` Reserved. 13356 =================== =============== =============== ======================================= 13357 13358.. 13359 13360 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above 13361 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table 13362 13363 =================== =============== ================ ================= ======================================= 13364 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description 13365 =================== =============== ================ ================= ======================================= 13366 reserved ``s_trap 0x00`` Reserved by hardware. 13367 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 13368 breakpoints. Causes wave to be halted 13369 with the PC at the trap instruction. 13370 The debugger is responsible to resume 13371 the wave, including the instruction 13372 that the breakpoint overwrote. 13373 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 13374 ``queue_ptr`` the trap instruction. The associated 13375 queue is signalled to put it into the 13376 error state. When the queue is put in 13377 the error state, the waves executing 13378 dispatches on the queue will be 13379 terminated. 13380 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 13381 as a no-operation. The trap handler 13382 is entered and immediately returns to 13383 continue execution of the wavefront. 13384 - If the debugger is enabled, causes 13385 the debug trap to be reported by the 13386 debugger and the wavefront is put in 13387 the halt state with the PC at the 13388 instruction. The debugger must 13389 increment the PC and resume the wave. 13390 reserved ``s_trap 0x04`` Reserved. 13391 reserved ``s_trap 0x05`` Reserved. 13392 reserved ``s_trap 0x06`` Reserved. 13393 reserved ``s_trap 0x07`` Reserved. 13394 reserved ``s_trap 0x08`` Reserved. 13395 reserved ``s_trap 0xfe`` Reserved. 13396 reserved ``s_trap 0xff`` Reserved. 13397 =================== =============== ================ ================= ======================================= 13398 13399.. _amdgpu-amdhsa-function-call-convention: 13400 13401Call Convention 13402~~~~~~~~~~~~~~~ 13403 13404.. note:: 13405 13406 This section is currently incomplete and has inaccuracies. It is WIP that will 13407 be updated as information is determined. 13408 13409See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 13410addresses. Unswizzled addresses are normal linear addresses. 13411 13412.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 13413 13414Kernel Functions 13415++++++++++++++++ 13416 13417This section describes the call convention ABI for the outer kernel function. 13418 13419See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 13420convention. 13421 13422The following is not part of the AMDGPU kernel calling convention but describes 13423how the AMDGPU implements function calls: 13424 134251. Clang decides the kernarg layout to match the *HSA Programmer's Language 13426 Reference* [HSA]_. 13427 13428 - All structs are passed directly. 13429 - Lambda values are passed *TBA*. 13430 13431 .. TODO:: 13432 13433 - Does this really follow HSA rules? Or are structs >16 bytes passed 13434 by-value struct? 13435 - What is ABI for lambda values? 13436 134374. The kernel performs certain setup in its prolog, as described in 13438 :ref:`amdgpu-amdhsa-kernel-prolog`. 13439 13440.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 13441 13442Non-Kernel Functions 13443++++++++++++++++++++ 13444 13445This section describes the call convention ABI for functions other than the 13446outer kernel function. 13447 13448If a kernel has function calls then scratch is always allocated and used for 13449the call stack which grows from low address to high address using the swizzled 13450scratch address space. 13451 13452On entry to a function: 13453 134541. SGPR0-3 contain a V# with the following properties (see 13455 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 13456 13457 * Base address pointing to the beginning of the wavefront scratch backing 13458 memory. 13459 * Swizzled with dword element size and stride of wavefront size elements. 13460 134612. The FLAT_SCRATCH register pair is setup. See 13462 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 134633. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 13464 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 134654. The EXEC register is set to the lanes active on entry to the function. 134665. MODE register: *TBD* 134676. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 13468 below. 134697. SGPR30-31 return address (RA). The code address that the function must 13470 return to when it completes. The value is undefined if the function is *no 13471 return*. 134728. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 13473 offset relative to the beginning of the wavefront scratch backing memory. 13474 13475 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 13476 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 13477 manner. 13478 13479 The unswizzled SP value can be converted into the swizzled SP value by: 13480 13481 | swizzled SP = unswizzled SP / wavefront size 13482 13483 This may be used to obtain the private address space address of stack 13484 objects and to convert this address to a flat address by adding the flat 13485 scratch aperture base address. 13486 13487 The swizzled SP value is always 4 bytes aligned for the ``r600`` 13488 architecture and 16 byte aligned for the ``amdgcn`` architecture. 13489 13490 .. note:: 13491 13492 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 13493 OpenCL language which has the largest base type defined as 16 bytes. 13494 13495 On entry, the swizzled SP value is the address of the first function 13496 argument passed on the stack. Other stack passed arguments are positive 13497 offsets from the entry swizzled SP value. 13498 13499 The function may use positive offsets beyond the last stack passed argument 13500 for stack allocated local variables and register spill slots. If necessary, 13501 the function may align these to greater alignment than 16 bytes. After these 13502 the function may dynamically allocate space for such things as runtime sized 13503 ``alloca`` local allocations. 13504 13505 If the function calls another function, it will place any stack allocated 13506 arguments after the last local allocation and adjust SGPR32 to the address 13507 after the last local allocation. 13508 135099. All other registers are unspecified. 1351010. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 13511 to the function. 13512 13513On exit from a function: 13514 135151. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 13516 described below. Any registers used are considered clobbered registers. 135172. The following registers are preserved and have the same value as on entry: 13518 13519 * FLAT_SCRATCH 13520 * EXEC 13521 * GFX6-GFX8: M0 13522 * All SGPR registers except the clobbered registers of SGPR4-31. 13523 * VGPR40-47 13524 * VGPR56-63 13525 * VGPR72-79 13526 * VGPR88-95 13527 * VGPR104-111 13528 * VGPR120-127 13529 * VGPR136-143 13530 * VGPR152-159 13531 * VGPR168-175 13532 * VGPR184-191 13533 * VGPR200-207 13534 * VGPR216-223 13535 * VGPR232-239 13536 * VGPR248-255 13537 13538 .. note:: 13539 13540 Except the argument registers, the VGPRs clobbered and the preserved 13541 registers are intermixed at regular intervals in order to keep a 13542 similar ratio independent of the number of allocated VGPRs. 13543 13544 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31. 13545 * Lanes of all VGPRs that are inactive at the call site. 13546 13547 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 13548 optimization may mark some of clobbered SGPR and VGPR registers as 13549 preserved if it can be determined that the called function does not change 13550 their value. 13551 135522. The PC is set to the RA provided on entry. 135533. MODE register: *TBD*. 135544. All other registers are clobbered. 135555. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 13556 function is available to the caller. 13557 13558.. TODO:: 13559 13560 - How are function results returned? The address of structured types is passed 13561 by reference, but what about other types? 13562 13563The function input arguments are made up of the formal arguments explicitly 13564declared by the source language function plus the implicit input arguments used 13565by the implementation. 13566 13567The source language input arguments are: 13568 135691. Any source language implicit ``this`` or ``self`` argument comes first as a 13570 pointer type. 135712. Followed by the function formal arguments in left to right source order. 13572 13573The source language result arguments are: 13574 135751. The function result argument. 13576 13577The source language input or result struct type arguments that are less than or 13578equal to 16 bytes, are decomposed recursively into their base type fields, and 13579each field is passed as if a separate argument. For input arguments, if the 13580called function requires the struct to be in memory, for example because its 13581address is taken, then the function body is responsible for allocating a stack 13582location and copying the field arguments into it. Clang terms this *direct 13583struct*. 13584 13585The source language input struct type arguments that are greater than 16 bytes, 13586are passed by reference. The caller is responsible for allocating a stack 13587location to make a copy of the struct value and pass the address as the input 13588argument. The called function is responsible to perform the dereference when 13589accessing the input argument. Clang terms this *by-value struct*. 13590 13591A source language result struct type argument that is greater than 16 bytes, is 13592returned by reference. The caller is responsible for allocating a stack location 13593to hold the result value and passes the address as the last input argument 13594(before the implicit input arguments). In this case there are no result 13595arguments. The called function is responsible to perform the dereference when 13596storing the result value. Clang terms this *structured return (sret)*. 13597 13598*TODO: correct the ``sret`` definition.* 13599 13600.. TODO:: 13601 13602 Is this definition correct? Or is ``sret`` only used if passing in registers, and 13603 pass as non-decomposed struct as stack argument? Or something else? Is the 13604 memory location in the caller stack frame, or a stack memory argument and so 13605 no address is passed as the caller can directly write to the argument stack 13606 location? But then the stack location is still live after return. If an 13607 argument stack location is it the first stack argument or the last one? 13608 13609Lambda argument types are treated as struct types with an implementation defined 13610set of fields. 13611 13612.. TODO:: 13613 13614 Need to specify the ABI for lambda types for AMDGPU. 13615 13616For AMDGPU backend all source language arguments (including the decomposed 13617struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 13618they are passed in SGPRs. 13619 13620The AMDGPU backend walks the function call graph from the leaves to determine 13621which implicit input arguments are used, propagating to each caller of the 13622function. The used implicit arguments are appended to the function arguments 13623after the source language arguments in the following order: 13624 13625.. TODO:: 13626 13627 Is recursion or external functions supported? 13628 136291. Work-Item ID (1 VGPR) 13630 13631 The X, Y and Z work-item ID are packed into a single VGRP with the following 13632 layout. Only fields actually used by the function are set. The other bits 13633 are undefined. 13634 13635 The values come from the initial kernel execution state. See 13636 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 13637 13638 .. table:: Work-item implicit argument layout 13639 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 13640 13641 ======= ======= ============== 13642 Bits Size Field Name 13643 ======= ======= ============== 13644 9:0 10 bits X Work-Item ID 13645 19:10 10 bits Y Work-Item ID 13646 29:20 10 bits Z Work-Item ID 13647 31:30 2 bits Unused 13648 ======= ======= ============== 13649 136502. Dispatch Ptr (2 SGPRs) 13651 13652 The value comes from the initial kernel execution state. See 13653 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13654 136553. Queue Ptr (2 SGPRs) 13656 13657 The value comes from the initial kernel execution state. See 13658 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13659 136604. Kernarg Segment Ptr (2 SGPRs) 13661 13662 The value comes from the initial kernel execution state. See 13663 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13664 136655. Dispatch id (2 SGPRs) 13666 13667 The value comes from the initial kernel execution state. See 13668 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13669 136706. Work-Group ID X (1 SGPR) 13671 13672 The value comes from the initial kernel execution state. See 13673 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13674 136757. Work-Group ID Y (1 SGPR) 13676 13677 The value comes from the initial kernel execution state. See 13678 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13679 136808. Work-Group ID Z (1 SGPR) 13681 13682 The value comes from the initial kernel execution state. See 13683 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13684 136859. Implicit Argument Ptr (2 SGPRs) 13686 13687 The value is computed by adding an offset to Kernarg Segment Ptr to get the 13688 global address space pointer to the first kernarg implicit argument. 13689 13690The input and result arguments are assigned in order in the following manner: 13691 13692.. note:: 13693 13694 There are likely some errors and omissions in the following description that 13695 need correction. 13696 13697 .. TODO:: 13698 13699 Check the Clang source code to decipher how function arguments and return 13700 results are handled. Also see the AMDGPU specific values used. 13701 13702* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 13703 VGPR31. 13704 13705 If there are more arguments than will fit in these registers, the remaining 13706 arguments are allocated on the stack in order on naturally aligned 13707 addresses. 13708 13709 .. TODO:: 13710 13711 How are overly aligned structures allocated on the stack? 13712 13713* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 13714 SGPR29. 13715 13716 If there are more arguments than will fit in these registers, the remaining 13717 arguments are allocated on the stack in order on naturally aligned 13718 addresses. 13719 13720Note that decomposed struct type arguments may have some fields passed in 13721registers and some in memory. 13722 13723.. TODO:: 13724 13725 So, a struct which can pass some fields as decomposed register arguments, will 13726 pass the rest as decomposed stack elements? But an argument that will not start 13727 in registers will not be decomposed and will be passed as a non-decomposed 13728 stack value? 13729 13730The following is not part of the AMDGPU function calling convention but 13731describes how the AMDGPU implements function calls: 13732 137331. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 13734 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 13735 are used, or for the reasons defined in ``SIFrameLowering``. 137362. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 13737 to access the incoming stack arguments in the function. The BP is needed 13738 only when the function requires the runtime stack alignment. 13739 137403. Allocating SGPR arguments on the stack are not supported. 13741 137424. No CFI is currently generated. See 13743 :ref:`amdgpu-dwarf-call-frame-information`. 13744 13745 .. note:: 13746 13747 CFI will be generated that defines the CFA as the unswizzled address 13748 relative to the wave scratch base in the unswizzled private address space 13749 of the lowest address stack allocated local variable. 13750 13751 ``DW_AT_frame_base`` will be defined as the swizzled address in the 13752 swizzled private address space by dividing the CFA by the wavefront size 13753 (since CFA is always at least dword aligned which matches the scratch 13754 swizzle element size). 13755 13756 If no dynamic stack alignment was performed, the stack allocated arguments 13757 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 13758 local variables and register spill slots are accessed as positive offsets 13759 relative to ``DW_AT_frame_base``. 13760 137615. Function argument passing is implemented by copying the input physical 13762 registers to virtual registers on entry. The register allocator can spill if 13763 necessary. These are copied back to physical registers at call sites. The 13764 net effect is that each function call can have these values in entirely 13765 distinct locations. The IPRA can help avoid shuffling argument registers. 137666. Call sites are implemented by setting up the arguments at positive offsets 13767 from SP. Then SP is incremented to account for the known frame size before 13768 the call and decremented after the call. 13769 13770 .. note:: 13771 13772 The CFI will reflect the changed calculation needed to compute the CFA 13773 from SP. 13774 137757. 4 byte spill slots are used in the stack frame. One slot is allocated for an 13776 emergency spill slot. Buffer instructions are used for stack accesses and 13777 not the ``flat_scratch`` instruction. 13778 13779 .. TODO:: 13780 13781 Explain when the emergency spill slot is used. 13782 13783.. TODO:: 13784 13785 Possible broken issues: 13786 13787 - Stack arguments must be aligned to required alignment. 13788 - Stack is aligned to max(16, max formal argument alignment) 13789 - Direct argument < 64 bits should check register budget. 13790 - Register budget calculation should respect ``inreg`` for SGPR. 13791 - SGPR overflow is not handled. 13792 - struct with 1 member unpeeling is not checking size of member. 13793 - ``sret`` is after ``this`` pointer. 13794 - Caller is not implementing stack realignment: need an extra pointer. 13795 - Should say AMDGPU passes FP rather than SP. 13796 - Should CFI define CFA as address of locals or arguments. Difference is 13797 apparent when have implemented dynamic alignment. 13798 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 13799 highest address of stack frame and use negative offset for locals. Would 13800 allow SP to be the same as FP and could support signal-handler-like as now 13801 have a real SP for the top of the stack. 13802 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 13803 arguments? 13804 13805AMDPAL 13806------ 13807 13808This section provides code conventions used when the target triple OS is 13809``amdpal`` (see :ref:`amdgpu-target-triples`). 13810 13811.. _amdgpu-amdpal-code-object-metadata-section: 13812 13813Code Object Metadata 13814~~~~~~~~~~~~~~~~~~~~ 13815 13816.. note:: 13817 13818 The metadata is currently in development and is subject to major 13819 changes. Only the current version is supported. *When this document 13820 was generated the version was 2.6.* 13821 13822Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 13823record (see :ref:`amdgpu-note-records-v3-onwards`). 13824 13825The metadata is represented as Message Pack formatted binary data (see 13826[MsgPack]_). The top level is a Message Pack map that includes the keys 13827defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 13828and referenced tables. 13829 13830Additional information can be added to the maps. To avoid conflicts, any 13831key names should be prefixed by "*vendor-name*." where ``vendor-name`` 13832can be the name of the vendor and specific vendor tool that generates the 13833information. The prefix is abbreviated to simply "." when it appears 13834within a map that has been added by the same *vendor-name*. 13835 13836 .. table:: AMDPAL Code Object Metadata Map 13837 :name: amdgpu-amdpal-code-object-metadata-map-table 13838 13839 =================== ============== ========= ====================================================================== 13840 String Key Value Type Required? Description 13841 =================== ============== ========= ====================================================================== 13842 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 13843 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 13844 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 13845 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 13846 definition of the keys included in that map. 13847 =================== ============== ========= ====================================================================== 13848 13849.. 13850 13851 .. table:: AMDPAL Code Object Pipeline Metadata Map 13852 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 13853 13854 ====================================== ============== ========= =================================================== 13855 String Key Value Type Required? Description 13856 ====================================== ============== ========= =================================================== 13857 ".name" string Source name of the pipeline. 13858 ".type" string Pipeline type, e.g. VsPs. Values include: 13859 13860 - "VsPs" 13861 - "Gs" 13862 - "Cs" 13863 - "Ngg" 13864 - "Tess" 13865 - "GsTess" 13866 - "NggTess" 13867 13868 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 13869 2 integers 64 bits is the "stable" portion of the hash, used 13870 for e.g. shader replacement lookup. Upper 64 bits 13871 is the "unique" portion of the hash, used for 13872 e.g. pipeline cache lookup. The value is 13873 implementation defined, and can not be relied on 13874 between different builds of the compiler. 13875 ".shaders" map Per-API shader metadata. See 13876 :ref:`amdgpu-amdpal-code-object-shader-map-table` 13877 for the definition of the keys included in that 13878 map. 13879 ".hardware_stages" map Per-hardware stage metadata. See 13880 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 13881 for the definition of the keys included in that 13882 map. 13883 ".shader_functions" map Per-shader function metadata. See 13884 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 13885 for the definition of the keys included in that 13886 map. 13887 ".registers" map Required Hardware register configuration. See 13888 :ref:`amdgpu-amdpal-code-object-register-map-table` 13889 for the definition of the keys included in that 13890 map. 13891 ".user_data_limit" integer Number of user data entries accessed by this 13892 pipeline. 13893 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 13894 NoUserDataSpilling. 13895 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 13896 viewport array index feature. Pipelines which use 13897 this feature can render into all 16 viewports, 13898 whereas pipelines which do not use it are 13899 restricted to viewport #0. 13900 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 13901 handling data-passing between the ES and GS 13902 shader stages. This can be zero if the data is 13903 passed using off-chip buffers. This value should 13904 be used to program all user-SGPRs which have been 13905 marked with "UserDataMapping::EsGsLdsSize" 13906 (typically only the GS and VS HW stages will ever 13907 have a user-SGPR so marked). 13908 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 13909 (maximum number of threads in a subgroup). 13910 ".num_interpolants" integer Graphics only. Number of PS interpolants. 13911 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 13912 ".api" string Name of the client graphics API. 13913 ".api_create_info" binary Graphics API shader create info binary blob. Can 13914 be defined by the driver using the compiler if 13915 they want to be able to correlate API-specific 13916 information used during creation at a later time. 13917 ====================================== ============== ========= =================================================== 13918 13919.. 13920 13921 .. table:: AMDPAL Code Object Shader Map 13922 :name: amdgpu-amdpal-code-object-shader-map-table 13923 13924 13925 +-------------+--------------+-------------------------------------------------------------------+ 13926 |String Key |Value Type |Description | 13927 +=============+==============+===================================================================+ 13928 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 13929 |- ".vertex" | |for the definition of the keys included in that map. | 13930 |- ".hull" | | | 13931 |- ".domain" | | | 13932 |- ".geometry"| | | 13933 |- ".pixel" | | | 13934 +-------------+--------------+-------------------------------------------------------------------+ 13935 13936.. 13937 13938 .. table:: AMDPAL Code Object API Shader Metadata Map 13939 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 13940 13941 ==================== ============== ========= ===================================================================== 13942 String Key Value Type Required? Description 13943 ==================== ============== ========= ===================================================================== 13944 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 13945 2 integers is implementation defined, and can not be relied on between 13946 different builds of the compiler. 13947 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 13948 string include: 13949 13950 - ".ls" 13951 - ".hs" 13952 - ".es" 13953 - ".gs" 13954 - ".vs" 13955 - ".ps" 13956 - ".cs" 13957 13958 ==================== ============== ========= ===================================================================== 13959 13960.. 13961 13962 .. table:: AMDPAL Code Object Hardware Stage Map 13963 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 13964 13965 +-------------+--------------+-----------------------------------------------------------------------+ 13966 |String Key |Value Type |Description | 13967 +=============+==============+=======================================================================+ 13968 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 13969 |- ".hs" | |for the definition of the keys included in that map. | 13970 |- ".es" | | | 13971 |- ".gs" | | | 13972 |- ".vs" | | | 13973 |- ".ps" | | | 13974 |- ".cs" | | | 13975 +-------------+--------------+-----------------------------------------------------------------------+ 13976 13977.. 13978 13979 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 13980 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 13981 13982 ========================== ============== ========= =============================================================== 13983 String Key Value Type Required? Description 13984 ========================== ============== ========= =============================================================== 13985 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 13986 ".scratch_memory_size" integer Scratch memory size in bytes. 13987 ".lds_size" integer Local Data Share size in bytes. 13988 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 13989 ".vgpr_count" integer Number of VGPRs used. 13990 ".agpr_count" integer Number of AGPRs used. 13991 ".sgpr_count" integer Number of SGPRs used. 13992 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 13993 directive to instruct the compiler to limit the VGPR usage to 13994 be less than or equal to the specified value (only set if 13995 different from HW default). 13996 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 13997 default). 13998 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 13999 3 integers 14000 ".wavefront_size" integer Wavefront size (only set if different from HW default). 14001 ".uses_uavs" boolean The shader reads or writes UAVs. 14002 ".uses_rovs" boolean The shader reads or writes ROVs. 14003 ".writes_uavs" boolean The shader writes to one or more UAVs. 14004 ".writes_depth" boolean The shader writes out a depth value. 14005 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 14006 memory or GDS. 14007 ".uses_prim_id" boolean The shader uses PrimID. 14008 ========================== ============== ========= =============================================================== 14009 14010.. 14011 14012 .. table:: AMDPAL Code Object Shader Function Map 14013 :name: amdgpu-amdpal-code-object-shader-function-map-table 14014 14015 =============== ============== ==================================================================== 14016 String Key Value Type Description 14017 =============== ============== ==================================================================== 14018 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 14019 entry address. The value is the function's metadata. See 14020 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 14021 =============== ============== ==================================================================== 14022 14023.. 14024 14025 .. table:: AMDPAL Code Object Shader Function Metadata Map 14026 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 14027 14028 ============================= ============== ================================================================= 14029 String Key Value Type Description 14030 ============================= ============== ================================================================= 14031 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 14032 2 integers is implementation defined, and can not be relied on between 14033 different builds of the compiler. 14034 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 14035 ".lds_size" integer Size in bytes of LDS memory. 14036 ".vgpr_count" integer Number of VGPRs used by the shader. 14037 ".sgpr_count" integer Number of SGPRs used by the shader. 14038 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 14039 ".shader_subtype" string Shader subtype/kind. Values include: 14040 14041 - "Unknown" 14042 14043 ============================= ============== ================================================================= 14044 14045.. 14046 14047 .. table:: AMDPAL Code Object Register Map 14048 :name: amdgpu-amdpal-code-object-register-map-table 14049 14050 ========================== ============== ==================================================================== 14051 32-bit Integer Key Value Type Description 14052 ========================== ============== ==================================================================== 14053 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 14054 a GRBM register (i.e., driver accessible GPU register number, not 14055 shader GPR register number). The driver is required to program each 14056 specified register to the corresponding specified value when 14057 executing this pipeline. Typically, the ``reg offsets`` are the 14058 ``uint16_t`` offsets to each register as defined by the hardware 14059 chip headers. The register is set to the provided value. However, a 14060 ``reg offset`` that specifies a user data register (e.g., 14061 COMPUTE_USER_DATA_0) needs special treatment. See 14062 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 14063 information. 14064 ========================== ============== ==================================================================== 14065 14066.. _amdgpu-amdpal-code-object-user-data-section: 14067 14068User Data 14069+++++++++ 14070 14071Each hardware stage has a set of 32-bit physical SPI *user data registers* 14072(either 16 or 32 based on graphics IP and the stage) which can be 14073written from a command buffer and then loaded into SGPRs when waves are 14074launched via a subsequent dispatch or draw operation. This is the way 14075most arguments are passed from the application/runtime to a hardware 14076shader. 14077 14078PAL abstracts this functionality by exposing a set of 128 *user data 14079entries* per pipeline a client can use to pass arguments from a command 14080buffer to one or more shaders in that pipeline. The ELF code object must 14081specify a mapping from virtualized *user data entries* to physical *user 14082data registers*, and PAL is responsible for implementing that mapping, 14083including spilling overflow *user data entries* to memory if needed. 14084 14085Since the *user data registers* are GRBM-accessible SPI registers, this 14086mapping is actually embedded in the ``.registers`` metadata entry. For 14087most registers, the value in that map is a literal 32-bit value that 14088should be written to the register by the driver. However, when the 14089register is a *user data register* (any USER_DATA register e.g., 14090SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 14091the driver to write either a *user data entry* value or one of several 14092driver-internal values to the register. This encoding is described in 14093the following table: 14094 14095.. note:: 14096 14097 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 14098 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 14099 always be programmed to the address of the GlobalTable, and *user data 14100 register* 1 must always be programmed to the address of the PerShaderTable. 14101 14102.. 14103 14104 .. table:: AMDPAL User Data Mapping 14105 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 14106 14107 ========== ================= =============================================================================== 14108 Value Name Description 14109 ========== ================= =============================================================================== 14110 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 14111 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 14112 always point to *user data register* 0). 14113 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 14114 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 14115 for more detail (should always point to *user data register* 1). 14116 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 14117 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 14118 more detail. 14119 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 14120 reference the draw index in the vertex shader. Only supported by the first 14121 stage in a graphics pipeline. 14122 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 14123 a graphics pipeline. 14124 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 14125 graphics pipeline. 14126 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 14127 a buffer containing the grid dimensions for a Compute dispatch operation. The 14128 high half of the address is stored in the next sequential user-SGPR. Only 14129 supported by compute pipelines. 14130 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 14131 space used for the ES/GS pseudo-ring-buffer for passing data between shader 14132 stages. 14133 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 14134 pipeline instancing. 14135 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 14136 can only appear for one shader stage per pipeline. 14137 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 14138 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 14139 only appear for one shader stage per pipeline. 14140 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 14141 only appear for one shader stage per pipeline (PS). These replace color targets 14142 and are completely separate from any UAVs used by the shader. This is optional, 14143 and only used by the PS when UAV exports are used to replace color-target 14144 exports to optimize specific shaders. 14145 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 14146 some NGG pipelines to perform culling. This value contains the address of the 14147 first of two consecutive registers which provide the full GPU address. 14148 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 14149 ========== ================= =============================================================================== 14150 14151.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 14152 14153Per-Shader Table 14154################ 14155 14156Low 32 bits of the GPU address for an optional buffer in the ``.data`` 14157section of the ELF. The high 32 bits of the address match the high 32 bits 14158of the shader's program counter. 14159 14160The buffer can be anything the shader compiler needs it for, and 14161allows each shader to have its own region of the ``.data`` section. 14162Typically, this could be a table of buffer SRD's and the data pointed to 14163by the buffer SRD's, but it could be a flat-address region of memory as 14164well. Its layout and usage are defined by the shader compiler. 14165 14166Each shader's table in the ``.data`` section is referenced by the symbol 14167``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 14168hardware shader stage the data is for. E.g., 14169``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 14170 14171.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 14172 14173Spill Table 14174########### 14175 14176It is possible for a hardware shader to need access to more *user data 14177entries* than there are slots available in user data registers for one 14178or more hardware shader stages. In that case, the PAL runtime expects 14179the necessary *user data entries* to be spilled to GPU memory and use 14180one user data register to point to the spilled user data memory. The 14181value of the *user data entry* must then represent the location where 14182a shader expects to read the low 32-bits of the table's GPU virtual 14183address. The *spill table* itself represents a set of 32-bit values 14184managed by the PAL runtime in GPU-accessible memory that can be made 14185indirectly accessible to a hardware shader. 14186 14187Unspecified OS 14188-------------- 14189 14190This section provides code conventions used when the target triple OS is 14191empty (see :ref:`amdgpu-target-triples`). 14192 14193Trap Handler ABI 14194~~~~~~~~~~~~~~~~ 14195 14196For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 14197not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 14198instructions are handled as follows: 14199 14200 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 14201 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 14202 14203 =============== =============== =========================================== 14204 Usage Code Sequence Description 14205 =============== =============== =========================================== 14206 llvm.trap s_endpgm Causes wavefront to be terminated. 14207 llvm.debugtrap *none* Compiler warning given that there is no 14208 trap handler installed. 14209 =============== =============== =========================================== 14210 14211Source Languages 14212================ 14213 14214.. _amdgpu-opencl: 14215 14216OpenCL 14217------ 14218 14219When the language is OpenCL the following differences occur: 14220 142211. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 142222. The AMDGPU backend appends additional arguments to the kernel's explicit 14223 arguments for the AMDHSA OS (see 14224 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 142253. Additional metadata is generated 14226 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 14227 14228 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 14229 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 14230 14231 ======== ==== ========= =========================================== 14232 Position Byte Byte Description 14233 Size Alignment 14234 ======== ==== ========= =========================================== 14235 1 8 8 OpenCL Global Offset X 14236 2 8 8 OpenCL Global Offset Y 14237 3 8 8 OpenCL Global Offset Z 14238 4 8 8 OpenCL address of printf buffer 14239 5 8 8 OpenCL address of virtual queue used by 14240 enqueue_kernel. 14241 6 8 8 OpenCL address of AqlWrap struct used by 14242 enqueue_kernel. 14243 7 8 8 Pointer argument used for Multi-gird 14244 synchronization. 14245 ======== ==== ========= =========================================== 14246 14247.. _amdgpu-hcc: 14248 14249HCC 14250--- 14251 14252When the language is HCC the following differences occur: 14253 142541. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 14255 14256.. _amdgpu-assembler: 14257 14258Assembler 14259--------- 14260 14261AMDGPU backend has LLVM-MC based assembler which is currently in development. 14262It supports AMDGCN GFX6-GFX11. 14263 14264This section describes general syntax for instructions and operands. 14265 14266Instructions 14267~~~~~~~~~~~~ 14268 14269An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 14270 14271 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 14272 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 14273 14274:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 14275:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 14276 14277The order of operands and modifiers is fixed. 14278Most modifiers are optional and may be omitted. 14279 14280Links to detailed instruction syntax description may be found in the following 14281table. Note that features under development are not included 14282in this description. 14283 14284 ============= ============================================= ======================================= 14285 Architecture Core ISA ISA Variants and Extensions 14286 ============= ============================================= ======================================= 14287 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 14288 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 14289 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 14290 14291 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 14292 14293 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 14294 14295 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 14296 14297 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 14298 14299 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>` 14300 14301 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 14302 14303 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 14304 14305 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>` 14306 14307 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>` 14308 14309 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 14310 14311 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 14312 14313 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>` 14314 14315 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>` 14316 14317 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>` 14318 14319 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>` 14320 14321 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>` 14322 14323 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>` 14324 14325 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>` 14326 14327 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>` 14328 ============= ============================================= ======================================= 14329 14330For more information about instructions, their semantics and supported 14331combinations of operands, refer to one of instruction set architecture manuals 14332[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 14333[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_, 14334[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_ and 14335[AMD-GCN-GFX10-RDNA2]_. 14336 14337Operands 14338~~~~~~~~ 14339 14340Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 14341 14342Modifiers 14343~~~~~~~~~ 14344 14345Detailed description of modifiers may be found 14346:doc:`here<AMDGPUModifierSyntax>`. 14347 14348Instruction Examples 14349~~~~~~~~~~~~~~~~~~~~ 14350 14351DS 14352++ 14353 14354.. code-block:: nasm 14355 14356 ds_add_u32 v2, v4 offset:16 14357 ds_write_src2_b64 v2 offset0:4 offset1:8 14358 ds_cmpst_f32 v2, v4, v6 14359 ds_min_rtn_f64 v[8:9], v2, v[4:5] 14360 14361For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 14362Manual. 14363 14364FLAT 14365++++ 14366 14367.. code-block:: nasm 14368 14369 flat_load_dword v1, v[3:4] 14370 flat_store_dwordx3 v[3:4], v[5:7] 14371 flat_atomic_swap v1, v[3:4], v5 glc 14372 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 14373 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 14374 14375For full list of supported instructions, refer to "FLAT instructions" in ISA 14376Manual. 14377 14378MUBUF 14379+++++ 14380 14381.. code-block:: nasm 14382 14383 buffer_load_dword v1, off, s[4:7], s1 14384 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 14385 buffer_store_format_xy v[1:2], off, s[4:7], s1 14386 buffer_wbinvl1 14387 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 14388 14389For full list of supported instructions, refer to "MUBUF Instructions" in ISA 14390Manual. 14391 14392SMRD/SMEM 14393+++++++++ 14394 14395.. code-block:: nasm 14396 14397 s_load_dword s1, s[2:3], 0xfc 14398 s_load_dwordx8 s[8:15], s[2:3], s4 14399 s_load_dwordx16 s[88:103], s[2:3], s4 14400 s_dcache_inv_vol 14401 s_memtime s[4:5] 14402 14403For full list of supported instructions, refer to "Scalar Memory Operations" in 14404ISA Manual. 14405 14406SOP1 14407++++ 14408 14409.. code-block:: nasm 14410 14411 s_mov_b32 s1, s2 14412 s_mov_b64 s[0:1], 0x80000000 14413 s_cmov_b32 s1, 200 14414 s_wqm_b64 s[2:3], s[4:5] 14415 s_bcnt0_i32_b64 s1, s[2:3] 14416 s_swappc_b64 s[2:3], s[4:5] 14417 s_cbranch_join s[4:5] 14418 14419For full list of supported instructions, refer to "SOP1 Instructions" in ISA 14420Manual. 14421 14422SOP2 14423++++ 14424 14425.. code-block:: nasm 14426 14427 s_add_u32 s1, s2, s3 14428 s_and_b64 s[2:3], s[4:5], s[6:7] 14429 s_cselect_b32 s1, s2, s3 14430 s_andn2_b32 s2, s4, s6 14431 s_lshr_b64 s[2:3], s[4:5], s6 14432 s_ashr_i32 s2, s4, s6 14433 s_bfm_b64 s[2:3], s4, s6 14434 s_bfe_i64 s[2:3], s[4:5], s6 14435 s_cbranch_g_fork s[4:5], s[6:7] 14436 14437For full list of supported instructions, refer to "SOP2 Instructions" in ISA 14438Manual. 14439 14440SOPC 14441++++ 14442 14443.. code-block:: nasm 14444 14445 s_cmp_eq_i32 s1, s2 14446 s_bitcmp1_b32 s1, s2 14447 s_bitcmp0_b64 s[2:3], s4 14448 s_setvskip s3, s5 14449 14450For full list of supported instructions, refer to "SOPC Instructions" in ISA 14451Manual. 14452 14453SOPP 14454++++ 14455 14456.. code-block:: nasm 14457 14458 s_barrier 14459 s_nop 2 14460 s_endpgm 14461 s_waitcnt 0 ; Wait for all counters to be 0 14462 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 14463 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 14464 s_sethalt 9 14465 s_sleep 10 14466 s_sendmsg 0x1 14467 s_sendmsg sendmsg(MSG_INTERRUPT) 14468 s_trap 1 14469 14470For full list of supported instructions, refer to "SOPP Instructions" in ISA 14471Manual. 14472 14473Unless otherwise mentioned, little verification is performed on the operands 14474of SOPP Instructions, so it is up to the programmer to be familiar with the 14475range or acceptable values. 14476 14477VALU 14478++++ 14479 14480For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 14481the assembler will automatically use optimal encoding based on its operands. To 14482force specific encoding, one can add a suffix to the opcode of the instruction: 14483 14484* _e32 for 32-bit VOP1/VOP2/VOPC 14485* _e64 for 64-bit VOP3 14486* _dpp for VOP_DPP 14487* _sdwa for VOP_SDWA 14488 14489VOP1/VOP2/VOP3/VOPC examples: 14490 14491.. code-block:: nasm 14492 14493 v_mov_b32 v1, v2 14494 v_mov_b32_e32 v1, v2 14495 v_nop 14496 v_cvt_f64_i32_e32 v[1:2], v2 14497 v_floor_f32_e32 v1, v2 14498 v_bfrev_b32_e32 v1, v2 14499 v_add_f32_e32 v1, v2, v3 14500 v_mul_i32_i24_e64 v1, v2, 3 14501 v_mul_i32_i24_e32 v1, -3, v3 14502 v_mul_i32_i24_e32 v1, -100, v3 14503 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 14504 v_max_f16_e32 v1, v2, v3 14505 14506VOP_DPP examples: 14507 14508.. code-block:: nasm 14509 14510 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 14511 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14512 v_mov_b32 v0, v0 wave_shl:1 14513 v_mov_b32 v0, v0 row_mirror 14514 v_mov_b32 v0, v0 row_bcast:31 14515 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 14516 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14517 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14518 14519VOP_SDWA examples: 14520 14521.. code-block:: nasm 14522 14523 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 14524 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 14525 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 14526 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 14527 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 14528 14529For full list of supported instructions, refer to "Vector ALU instructions". 14530 14531.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 14532 14533Code Object V2 Predefined Symbols 14534~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14535 14536.. warning:: 14537 Code object V2 is not the default code object version emitted by 14538 this version of LLVM. 14539 14540The AMDGPU assembler defines and updates some symbols automatically. These 14541symbols do not affect code generation. 14542 14543.option.machine_version_major 14544+++++++++++++++++++++++++++++ 14545 14546Set to the GFX major generation number of the target being assembled for. For 14547example, when assembling for a "GFX9" target this will be set to the integer 14548value "9". The possible GFX major generation numbers are presented in 14549:ref:`amdgpu-processors`. 14550 14551.option.machine_version_minor 14552+++++++++++++++++++++++++++++ 14553 14554Set to the GFX minor generation number of the target being assembled for. For 14555example, when assembling for a "GFX810" target this will be set to the integer 14556value "1". The possible GFX minor generation numbers are presented in 14557:ref:`amdgpu-processors`. 14558 14559.option.machine_version_stepping 14560++++++++++++++++++++++++++++++++ 14561 14562Set to the GFX stepping generation number of the target being assembled for. 14563For example, when assembling for a "GFX704" target this will be set to the 14564integer value "4". The possible GFX stepping generation numbers are presented 14565in :ref:`amdgpu-processors`. 14566 14567.kernel.vgpr_count 14568++++++++++++++++++ 14569 14570Set to zero each time a 14571:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14572encountered. At each instruction, if the current value of this symbol is less 14573than or equal to the maximum VGPR number explicitly referenced within that 14574instruction then the symbol value is updated to equal that VGPR number plus 14575one. 14576 14577.kernel.sgpr_count 14578++++++++++++++++++ 14579 14580Set to zero each time a 14581:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14582encountered. At each instruction, if the current value of this symbol is less 14583than or equal to the maximum VGPR number explicitly referenced within that 14584instruction then the symbol value is updated to equal that SGPR number plus 14585one. 14586 14587.. _amdgpu-amdhsa-assembler-directives-v2: 14588 14589Code Object V2 Directives 14590~~~~~~~~~~~~~~~~~~~~~~~~~ 14591 14592.. warning:: 14593 Code object V2 is not the default code object version emitted by 14594 this version of LLVM. 14595 14596AMDGPU ABI defines auxiliary data in output code object. In assembly source, 14597one can specify them with assembler directives. 14598 14599.hsa_code_object_version major, minor 14600+++++++++++++++++++++++++++++++++++++ 14601 14602*major* and *minor* are integers that specify the version of the HSA code 14603object that will be generated by the assembler. 14604 14605.hsa_code_object_isa [major, minor, stepping, vendor, arch] 14606+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 14607 14608 14609*major*, *minor*, and *stepping* are all integers that describe the instruction 14610set architecture (ISA) version of the assembly program. 14611 14612*vendor* and *arch* are quoted strings. *vendor* should always be equal to 14613"AMD" and *arch* should always be equal to "AMDGPU". 14614 14615By default, the assembler will derive the ISA version, *vendor*, and *arch* 14616from the value of the -mcpu option that is passed to the assembler. 14617 14618.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 14619 14620.amdgpu_hsa_kernel (name) 14621+++++++++++++++++++++++++ 14622 14623This directives specifies that the symbol with given name is a kernel entry 14624point (label) and the object should contain corresponding symbol of type 14625STT_AMDGPU_HSA_KERNEL. 14626 14627.amd_kernel_code_t 14628++++++++++++++++++ 14629 14630This directive marks the beginning of a list of key / value pairs that are used 14631to specify the amd_kernel_code_t object that will be emitted by the assembler. 14632The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 14633amd_kernel_code_t values that are unspecified a default value will be used. The 14634default value for all keys is 0, with the following exceptions: 14635 14636- *amd_code_version_major* defaults to 1. 14637- *amd_kernel_code_version_minor* defaults to 2. 14638- *amd_machine_kind* defaults to 1. 14639- *amd_machine_version_major*, *machine_version_minor*, and 14640 *amd_machine_version_stepping* are derived from the value of the -mcpu option 14641 that is passed to the assembler. 14642- *kernel_code_entry_byte_offset* defaults to 256. 14643- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 14644 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 14645 Note that wavefront size is specified as a power of two, so a value of **n** 14646 means a size of 2^ **n**. 14647- *call_convention* defaults to -1. 14648- *kernarg_segment_alignment*, *group_segment_alignment*, and 14649 *private_segment_alignment* default to 4. Note that alignments are specified 14650 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 14651- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 14652 GFX90A onwards. 14653- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 14654 GFX10 onwards. 14655- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 14656 14657The *.amd_kernel_code_t* directive must be placed immediately after the 14658function label and before any instructions. 14659 14660For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 14661comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 14662 14663.. _amdgpu-amdhsa-assembler-example-v2: 14664 14665Code Object V2 Example Source Code 14666~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14667 14668.. warning:: 14669 Code Object V2 is not the default code object version emitted by 14670 this version of LLVM. 14671 14672Here is an example of a minimal assembly source file, defining one HSA kernel: 14673 14674.. code:: 14675 :number-lines: 14676 14677 .hsa_code_object_version 1,0 14678 .hsa_code_object_isa 14679 14680 .hsatext 14681 .globl hello_world 14682 .p2align 8 14683 .amdgpu_hsa_kernel hello_world 14684 14685 hello_world: 14686 14687 .amd_kernel_code_t 14688 enable_sgpr_kernarg_segment_ptr = 1 14689 is_ptr64 = 1 14690 compute_pgm_rsrc1_vgprs = 0 14691 compute_pgm_rsrc1_sgprs = 0 14692 compute_pgm_rsrc2_user_sgpr = 2 14693 compute_pgm_rsrc1_wgp_mode = 0 14694 compute_pgm_rsrc1_mem_ordered = 0 14695 compute_pgm_rsrc1_fwd_progress = 1 14696 .end_amd_kernel_code_t 14697 14698 s_load_dwordx2 s[0:1], s[0:1] 0x0 14699 v_mov_b32 v0, 3.14159 14700 s_waitcnt lgkmcnt(0) 14701 v_mov_b32 v1, s0 14702 v_mov_b32 v2, s1 14703 flat_store_dword v[1:2], v0 14704 s_endpgm 14705 .Lfunc_end0: 14706 .size hello_world, .Lfunc_end0-hello_world 14707 14708.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards: 14709 14710Code Object V3 and Above Predefined Symbols 14711~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14712 14713The AMDGPU assembler defines and updates some symbols automatically. These 14714symbols do not affect code generation. 14715 14716.amdgcn.gfx_generation_number 14717+++++++++++++++++++++++++++++ 14718 14719Set to the GFX major generation number of the target being assembled for. For 14720example, when assembling for a "GFX9" target this will be set to the integer 14721value "9". The possible GFX major generation numbers are presented in 14722:ref:`amdgpu-processors`. 14723 14724.amdgcn.gfx_generation_minor 14725++++++++++++++++++++++++++++ 14726 14727Set to the GFX minor generation number of the target being assembled for. For 14728example, when assembling for a "GFX810" target this will be set to the integer 14729value "1". The possible GFX minor generation numbers are presented in 14730:ref:`amdgpu-processors`. 14731 14732.amdgcn.gfx_generation_stepping 14733+++++++++++++++++++++++++++++++ 14734 14735Set to the GFX stepping generation number of the target being assembled for. 14736For example, when assembling for a "GFX704" target this will be set to the 14737integer value "4". The possible GFX stepping generation numbers are presented 14738in :ref:`amdgpu-processors`. 14739 14740.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 14741 14742.amdgcn.next_free_vgpr 14743++++++++++++++++++++++ 14744 14745Set to zero before assembly begins. At each instruction, if the current value 14746of this symbol is less than or equal to the maximum VGPR number explicitly 14747referenced within that instruction then the symbol value is updated to equal 14748that VGPR number plus one. 14749 14750May be used to set the `.amdhsa_next_free_vgpr` directive in 14751:ref:`amdhsa-kernel-directives-table`. 14752 14753May be set at any time, e.g. manually set to zero at the start of each kernel. 14754 14755.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 14756 14757.amdgcn.next_free_sgpr 14758++++++++++++++++++++++ 14759 14760Set to zero before assembly begins. At each instruction, if the current value 14761of this symbol is less than or equal the maximum SGPR number explicitly 14762referenced within that instruction then the symbol value is updated to equal 14763that SGPR number plus one. 14764 14765May be used to set the `.amdhsa_next_free_spgr` directive in 14766:ref:`amdhsa-kernel-directives-table`. 14767 14768May be set at any time, e.g. manually set to zero at the start of each kernel. 14769 14770.. _amdgpu-amdhsa-assembler-directives-v3-onwards: 14771 14772Code Object V3 and Above Directives 14773~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14774 14775Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 14776architecture processors, and are not OS-specific. Directives which begin with 14777``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 14778``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 14779:ref:`amdgpu-processors`. 14780 14781.. _amdgpu-assembler-directive-amdgcn-target: 14782 14783.amdgcn_target <target-triple> "-" <target-id> 14784++++++++++++++++++++++++++++++++++++++++++++++ 14785 14786Optional directive which declares the ``<target-triple>-<target-id>`` supported 14787by the containing assembler source file. Used by the assembler to validate 14788command-line options such as ``-triple``, ``-mcpu``, and 14789``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 14790:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 14791 14792.. note:: 14793 14794 The target ID syntax used for code object V2 to V3 for this directive differs 14795 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 14796 14797.amdhsa_kernel <name> 14798+++++++++++++++++++++ 14799 14800Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 14801``<name>.kd``, in the current location of the current section. Only valid when 14802the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 14803instruction to execute, and does not need to be previously defined. 14804 14805Marks the beginning of a list of directives used to generate the bytes of a 14806kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 14807Directives which may appear in this list are described in 14808:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 14809be valid for the target being assembled for, and cannot be repeated. Directives 14810support the range of values specified by the field they reference in 14811:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 14812assumed to have its default value, unless it is marked as "Required", in which 14813case it is an error to omit the directive. This list of directives is 14814terminated by an ``.end_amdhsa_kernel`` directive. 14815 14816 .. table:: AMDHSA Kernel Assembler Directives 14817 :name: amdhsa-kernel-directives-table 14818 14819 ======================================================== =================== ============ =================== 14820 Directive Default Supported On Description 14821 ======================================================== =================== ============ =================== 14822 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX11 Controls GROUP_SEGMENT_FIXED_SIZE in 14823 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14824 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX11 Controls PRIVATE_SEGMENT_FIXED_SIZE in 14825 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14826 ``.amdhsa_kernarg_size`` 0 GFX6-GFX11 Controls KERNARG_SIZE in 14827 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14828 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX11 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2 14829 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` 14830 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 14831 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14832 GFX940) 14833 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_PTR in 14834 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14835 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_QUEUE_PTR in 14836 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14837 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 14838 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14839 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_ID in 14840 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14841 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 14842 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14843 GFX940) 14844 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX11 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 14845 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14846 ``.amdhsa_wavefront_size32`` Target GFX10-GFX11 Controls ENABLE_WAVEFRONT_SIZE32 in 14847 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14848 Specific 14849 (wavefrontsize64) 14850 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 14851 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14852 GFX940) 14853 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in 14854 GFX11 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14855 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_X in 14856 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14857 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 14858 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14859 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 14860 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14861 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_INFO in 14862 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14863 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX11 Controls ENABLE_VGPR_WORKITEM_ID in 14864 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14865 Possible values are defined in 14866 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 14867 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX11 Maximum VGPR number explicitly referenced, plus one. 14868 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 14869 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14870 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX11 Maximum SGPR number explicitly referenced, plus one. 14871 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14872 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14873 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file. 14874 GFX940 Used to calculate ACCUM_OFFSET in 14875 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14876 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX11 Whether the kernel may use the special VCC SGPR. 14877 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14878 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14879 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 14880 (except scratch memory. Used to calculate 14881 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in 14882 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14883 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 14884 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14885 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14886 (xnack) 14887 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_32 in 14888 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14889 Possible values are defined in 14890 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14891 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_16_64 in 14892 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14893 Possible values are defined in 14894 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14895 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX11 Controls FLOAT_DENORM_MODE_32 in 14896 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14897 Possible values are defined in 14898 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14899 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX11 Controls FLOAT_DENORM_MODE_16_64 in 14900 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14901 Possible values are defined in 14902 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14903 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in 14904 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14905 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in 14906 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14907 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX11 Controls FP16_OVFL in 14908 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14909 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in 14910 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14911 Specific GFX11 14912 (tgsplit) 14913 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX11 Controls ENABLE_WGP_MODE in 14914 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14915 Specific 14916 (cumode) 14917 ``.amdhsa_memory_ordered`` 1 GFX10-GFX11 Controls MEM_ORDERED in 14918 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14919 ``.amdhsa_forward_progress`` 0 GFX10-GFX11 Controls FWD_PROGRESS in 14920 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`. 14921 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in 14922 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`. 14923 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 14924 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14925 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 14926 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14927 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 14928 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14929 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 14930 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14931 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 14932 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14933 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 14934 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14935 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 14936 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`. 14937 ======================================================== =================== ============ =================== 14938 14939.amdgpu_metadata 14940++++++++++++++++ 14941 14942Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 14943note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`). 14944 14945The contents must be in the [YAML]_ markup format, with the same structure and 14946semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 14947:ref:`amdgpu-amdhsa-code-object-metadata-v4` or 14948:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 14949 14950This directive is terminated by an ``.end_amdgpu_metadata`` directive. 14951 14952.. _amdgpu-amdhsa-assembler-example-v3-onwards: 14953 14954Code Object V3 and Above Example Source Code 14955~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14956 14957Here is an example of a minimal assembly source file, defining one HSA kernel: 14958 14959.. code:: 14960 :number-lines: 14961 14962 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 14963 14964 .text 14965 .globl hello_world 14966 .p2align 8 14967 .type hello_world,@function 14968 hello_world: 14969 s_load_dwordx2 s[0:1], s[0:1] 0x0 14970 v_mov_b32 v0, 3.14159 14971 s_waitcnt lgkmcnt(0) 14972 v_mov_b32 v1, s0 14973 v_mov_b32 v2, s1 14974 flat_store_dword v[1:2], v0 14975 s_endpgm 14976 .Lfunc_end0: 14977 .size hello_world, .Lfunc_end0-hello_world 14978 14979 .rodata 14980 .p2align 6 14981 .amdhsa_kernel hello_world 14982 .amdhsa_user_sgpr_kernarg_segment_ptr 1 14983 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 14984 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 14985 .end_amdhsa_kernel 14986 14987 .amdgpu_metadata 14988 --- 14989 amdhsa.version: 14990 - 1 14991 - 0 14992 amdhsa.kernels: 14993 - .name: hello_world 14994 .symbol: hello_world.kd 14995 .kernarg_segment_size: 48 14996 .group_segment_fixed_size: 0 14997 .private_segment_fixed_size: 0 14998 .kernarg_segment_align: 4 14999 .wavefront_size: 64 15000 .sgpr_count: 2 15001 .vgpr_count: 3 15002 .max_flat_workgroup_size: 256 15003 .args: 15004 - .size: 8 15005 .offset: 0 15006 .value_kind: global_buffer 15007 .address_space: global 15008 .actual_access: write_only 15009 //... 15010 .end_amdgpu_metadata 15011 15012This kernel is equivalent to the following HIP program: 15013 15014.. code:: 15015 :number-lines: 15016 15017 __global__ void hello_world(float *p) { 15018 *p = 3.14159f; 15019 } 15020 15021If an assembly source file contains multiple kernels and/or functions, the 15022:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 15023:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 15024the ``.set <symbol>, <expression>`` directive. For example, in the case of two 15025kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 15026to group the function with the kernel that calls it and reset the symbols 15027between the two connected components: 15028 15029.. code:: 15030 :number-lines: 15031 15032 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 15033 15034 // gpr tracking symbols are implicitly set to zero 15035 15036 .text 15037 .globl kern0 15038 .p2align 8 15039 .type kern0,@function 15040 kern0: 15041 // ... 15042 s_endpgm 15043 .Lkern0_end: 15044 .size kern0, .Lkern0_end-kern0 15045 15046 .rodata 15047 .p2align 6 15048 .amdhsa_kernel kern0 15049 // ... 15050 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 15051 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 15052 .end_amdhsa_kernel 15053 15054 // reset symbols to begin tracking usage in func1 and kern1 15055 .set .amdgcn.next_free_vgpr, 0 15056 .set .amdgcn.next_free_sgpr, 0 15057 15058 .text 15059 .hidden func1 15060 .global func1 15061 .p2align 2 15062 .type func1,@function 15063 func1: 15064 // ... 15065 s_setpc_b64 s[30:31] 15066 .Lfunc1_end: 15067 .size func1, .Lfunc1_end-func1 15068 15069 .globl kern1 15070 .p2align 8 15071 .type kern1,@function 15072 kern1: 15073 // ... 15074 s_getpc_b64 s[4:5] 15075 s_add_u32 s4, s4, func1@rel32@lo+4 15076 s_addc_u32 s5, s5, func1@rel32@lo+4 15077 s_swappc_b64 s[30:31], s[4:5] 15078 // ... 15079 s_endpgm 15080 .Lkern1_end: 15081 .size kern1, .Lkern1_end-kern1 15082 15083 .rodata 15084 .p2align 6 15085 .amdhsa_kernel kern1 15086 // ... 15087 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 15088 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 15089 .end_amdhsa_kernel 15090 15091These symbols cannot identify connected components in order to automatically 15092track the usage for each kernel. However, in some cases careful organization of 15093the kernels and functions in the source file means there is minimal additional 15094effort required to accurately calculate GPR usage. 15095 15096Additional Documentation 15097======================== 15098 15099.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 15100.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 15101.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 15102.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 15103.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 15104.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 15105.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__ 15106.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 15107.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 15108.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 15109.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 15110.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 15111.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 15112.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 15113.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 15114.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 15115.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 15116.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 15117.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 15118.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 15119.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 15120.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 15121.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 15122.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 15123.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 15124