1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 12R600 family up until the current GCN families. It lives in the 13``lib/Target/AMDGPU`` directory. 14 15LLVM 16==== 17 18.. _amdgpu-target-triples: 19 20Target Triples 21-------------- 22 23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to 24specify the target triple: 25 26 .. table:: AMDGPU Architectures 27 :name: amdgpu-architecture-table 28 29 ============ ============================================================== 30 Architecture Description 31 ============ ============================================================== 32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 34 ============ ============================================================== 35 36 .. table:: AMDGPU Vendors 37 :name: amdgpu-vendor-table 38 39 ============ ============================================================== 40 Vendor Description 41 ============ ============================================================== 42 ``amd`` Can be used for all AMD GPU usage. 43 ``mesa3d`` Can be used if the OS is ``mesa3d``. 44 ============ ============================================================== 45 46 .. table:: AMDGPU Operating Systems 47 :name: amdgpu-os-table 48 49 ============== ============================================================ 50 OS Description 51 ============== ============================================================ 52 *<empty>* Defaults to the *unknown* OS. 53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 54 such as AMD's ROCm [AMD-ROCm]_. 55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL 56 runtime. 57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 58 runtime. 59 ============== ============================================================ 60 61 .. table:: AMDGPU Environments 62 :name: amdgpu-environment-table 63 64 ============ ============================================================== 65 Environment Description 66 ============ ============================================================== 67 *<empty>* Defaults to ``opencl``. 68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`). 69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is 70 used (see :ref:`amdgpu-address-spaces`). 71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is 72 used (see :ref:`amdgpu-address-spaces`). 73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`). 74 ============ ============================================================== 75 76.. _amdgpu-processors: 77 78Processors 79---------- 80 81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The 82names from both the *Processor* and *Alternative Processor* can be used. 83 84 .. table:: AMDGPU Processors 85 :name: amdgpu-processor-table 86 87 =========== =============== ============ ===== ========= ======= ================== 88 Processor Alternative Target dGPU/ Target ROCm Example 89 Processor Triple APU Features Support Products 90 Architecture Supported 91 [Default] 92 =========== =============== ============ ===== ========= ======= ================== 93 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 94 ----------------------------------------------------------------------------------- 95 ``r600`` ``r600`` dGPU 96 ``r630`` ``r600`` dGPU 97 ``rs880`` ``r600`` dGPU 98 ``rv670`` ``r600`` dGPU 99 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 100 ----------------------------------------------------------------------------------- 101 ``rv710`` ``r600`` dGPU 102 ``rv730`` ``r600`` dGPU 103 ``rv770`` ``r600`` dGPU 104 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 105 ----------------------------------------------------------------------------------- 106 ``cedar`` ``r600`` dGPU 107 ``redwood`` ``r600`` dGPU 108 ``sumo`` ``r600`` dGPU 109 ``juniper`` ``r600`` dGPU 110 ``cypress`` ``r600`` dGPU 111 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 112 ----------------------------------------------------------------------------------- 113 ``barts`` ``r600`` dGPU 114 ``turks`` ``r600`` dGPU 115 ``caicos`` ``r600`` dGPU 116 ``cayman`` ``r600`` dGPU 117 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 118 ----------------------------------------------------------------------------------- 119 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU 120 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU 121 - ``verde`` 122 - ``oland`` 123 - ``hainan`` 124 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 125 ----------------------------------------------------------------------------------- 126 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 127 - A6 Pro-7050B 128 - A8-7100 129 - A8 Pro-7150B 130 - A10-7300 131 - A10 Pro-7350B 132 - FX-7500 133 - A8-7200P 134 - A10-7400P 135 - FX-7600P 136 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 137 - FirePro W9100 138 - FirePro S9150 139 - FirePro S9170 140 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 141 - Radeon R9 290x 142 - Radeon R390 143 - Radeon R390x 144 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 145 - ``mullins`` - E1-2200 146 - E1-2500 147 - E2-3000 148 - E2-3800 149 - A4-5000 150 - A4-5100 151 - A6-5200 152 - A4 Pro-3340B 153 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 154 - Radeon HD 8770 155 - R7 260 156 - R7 260X 157 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 158 ----------------------------------------------------------------------------------- 159 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P 160 [on] - Pro A6-8500B 161 - A8-8600P 162 - Pro A8-8600B 163 - FX-8800P 164 - Pro A12-8800B 165 \ ``amdgcn`` APU - xnack ROCm - A10-8700P 166 [on] - Pro A10-8700B 167 - A10-8780P 168 \ ``amdgcn`` APU - xnack - A10-9600P 169 [on] - A10-9630P 170 - A12-9700P 171 - A12-9730P 172 - FX-9800P 173 - FX-9830P 174 \ ``amdgcn`` APU - xnack - E2-9010 175 [on] - A6-9210 176 - A9-9410 177 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 178 - ``iceland`` [off] - FirePro S7100 179 - FirePro W7100 180 - Radeon R285 181 - Radeon R9 380 182 - Radeon R9 385 183 - Mobile FirePro 184 M7170 185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano 186 [off] - Radeon R9 Fury 187 - Radeon R9 FuryX 188 - Radeon Pro Duo 189 - FirePro S9300x2 190 - Radeon Instinct MI8 191 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 192 [off] - Radeon RX 480 193 - Radeon Instinct MI6 194 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 195 [off] 196 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack 197 [on] 198 **GCN GFX9** [AMD-GCN-GFX9]_ 199 ----------------------------------------------------------------------------------- 200 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega 201 [off] Frontier Edition 202 - Radeon RX Vega 56 203 - Radeon RX Vega 64 204 - Radeon RX Vega 64 205 Liquid 206 - Radeon Instinct MI25 207 ``gfx902`` ``amdgcn`` APU - xnack *TBA* 208 [on] 209 .. TODO 210 Add product 211 names. 212 =========== =============== ============ ===== ========= ======= ================== 213 214.. _amdgpu-target-features: 215 216Target Features 217--------------- 218 219Target features control how code is generated to support certain 220processor specific features. Not all target features are supported by 221all processors. The runtime must ensure that the features supported by 222the device used to execute the code match the features enabled when 223generating the code. A mismatch of features may result in incorrect 224execution, or a reduction in performance. 225 226The target features supported by each processor, and the default value 227used if not specified explicitly, is listed in 228:ref:`amdgpu-processor-table`. 229 230Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU 231target features. 232 233For example: 234 235``-mxnack`` 236 Enable the ``xnack`` feature. 237``-mno-xnack`` 238 Disable the ``xnack`` feature. 239 240 .. table:: AMDGPU Target Features 241 :name: amdgpu-target-feature-table 242 243 ============== ================================================== 244 Target Feature Description 245 ============== ================================================== 246 -m[no-]xnack Enable/disable generating code that has 247 memory clauses that are compatible with 248 having XNACK replay enabled. 249 250 This is used for demand paging and page 251 migration. If XNACK replay is enabled in 252 the device, then if a page fault occurs 253 the code may execute incorrectly if the 254 ``xnack`` feature is not enabled. Executing 255 code that has the feature enabled on a 256 device that does not have XNACK replay 257 enabled will execute correctly, but may 258 be less performant than code with the 259 feature disabled. 260 ============== ================================================== 261 262.. _amdgpu-address-spaces: 263 264Address Spaces 265-------------- 266 267The AMDGPU backend uses the following address space mappings. 268 269The memory space names used in the table, aside from the region memory space, is 270from the OpenCL standard. 271 272LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). 273 274 .. table:: Address Space Mapping 275 :name: amdgpu-address-space-mapping-table 276 277 ================== ================= ================= ================= ================= 278 LLVM Address Space Memory Space 279 ------------------ ----------------------------------------------------------------------- 280 \ Current Default amdgiz/amdgizcl hcc Future Default 281 ================== ================= ================= ================= ================= 282 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat) 283 1 Global Global Global Global 284 2 Constant Constant Constant Region (GDS) 285 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS) 286 4 Generic (Flat) Region (GDS) Region (GDS) Constant 287 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch) 288 ================== ================= ================= ================= ================= 289 290Current Default 291 This is the current default address space mapping used for all languages 292 except hcc. This will shortly be deprecated. 293 294amdgiz/amdgizcl 295 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl`` 296 is specified as the target triple environment value. 297 298hcc 299 This is the current address space mapping used when ``hcc`` is specified as 300 the target triple environment value.This will shortly be deprecated. 301 302Future Default 303 This will shortly be the only address space mapping for all languages using 304 AMDGPU backend. 305 306.. _amdgpu-memory-scopes: 307 308Memory Scopes 309------------- 310 311This section provides LLVM memory synchronization scopes supported by the AMDGPU 312backend memory model when the target triple OS is ``amdhsa`` (see 313:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 314 315The memory model supported is based on the HSA memory model [HSA]_ which is 316based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 317relation is transitive over the synchonizes-with relation independent of scope, 318and synchonizes-with allows the memory scope instances to be inclusive (see 319table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 320 321This is different to the OpenCL [OpenCL]_ memory model which does not have scope 322inclusion and requires the memory scopes to exactly match. However, this 323is conservatively correct for OpenCL. 324 325 .. table:: AMDHSA LLVM Sync Scopes 326 :name: amdgpu-amdhsa-llvm-sync-scopes-table 327 328 ================ ========================================================== 329 LLVM Sync Scope Description 330 ================ ========================================================== 331 *none* The default: ``system``. 332 333 Synchronizes with, and participates in modification and 334 seq_cst total orderings with, other operations (except 335 image operations) for all address spaces (except private, 336 or generic that accesses private) provided the other 337 operation's sync scope is: 338 339 - ``system``. 340 - ``agent`` and executed by a thread on the same agent. 341 - ``workgroup`` and executed by a thread in the same 342 workgroup. 343 - ``wavefront`` and executed by a thread in the same 344 wavefront. 345 346 ``agent`` Synchronizes with, and participates in modification and 347 seq_cst total orderings with, other operations (except 348 image operations) for all address spaces (except private, 349 or generic that accesses private) provided the other 350 operation's sync scope is: 351 352 - ``system`` or ``agent`` and executed by a thread on the 353 same agent. 354 - ``workgroup`` and executed by a thread in the same 355 workgroup. 356 - ``wavefront`` and executed by a thread in the same 357 wavefront. 358 359 ``workgroup`` Synchronizes with, and participates in modification and 360 seq_cst total orderings with, other operations (except 361 image operations) for all address spaces (except private, 362 or generic that accesses private) provided the other 363 operation's sync scope is: 364 365 - ``system``, ``agent`` or ``workgroup`` and executed by a 366 thread in the same workgroup. 367 - ``wavefront`` and executed by a thread in the same 368 wavefront. 369 370 ``wavefront`` Synchronizes with, and participates in modification and 371 seq_cst total orderings with, other operations (except 372 image operations) for all address spaces (except private, 373 or generic that accesses private) provided the other 374 operation's sync scope is: 375 376 - ``system``, ``agent``, ``workgroup`` or ``wavefront`` 377 and executed by a thread in the same wavefront. 378 379 ``singlethread`` Only synchronizes with, and participates in modification 380 and seq_cst total orderings with, other operations (except 381 image operations) running in the same thread for all 382 address spaces (for example, in signal handlers). 383 ================ ========================================================== 384 385AMDGPU Intrinsics 386----------------- 387 388The AMDGPU backend implements the following intrinsics. 389 390*This section is WIP.* 391 392.. TODO 393 List AMDGPU intrinsics 394 395Code Object 396=========== 397 398The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 399can be linked by ``lld`` to produce a standard ELF shared code object which can 400be loaded and executed on an AMDGPU target. 401 402Header 403------ 404 405The AMDGPU backend uses the following ELF header: 406 407 .. table:: AMDGPU ELF Header 408 :name: amdgpu-elf-header-table 409 410 ========================== =============================== 411 Field Value 412 ========================== =============================== 413 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 414 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 415 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 416 - ``ELFOSABI_AMDGPU_HSA`` 417 - ``ELFOSABI_AMDGPU_PAL`` 418 - ``ELFOSABI_AMDGPU_MESA3D`` 419 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` 420 - ``ELFABIVERSION_AMDGPU_PAL`` 421 - ``ELFABIVERSION_AMDGPU_MESA3D`` 422 ``e_type`` - ``ET_REL`` 423 - ``ET_DYN`` 424 ``e_machine`` ``EM_AMDGPU`` 425 ``e_entry`` 0 426 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` 427 ========================== =============================== 428 429.. 430 431 .. table:: AMDGPU ELF Header Enumeration Values 432 :name: amdgpu-elf-header-enumeration-values-table 433 434 =============================== ===== 435 Name Value 436 =============================== ===== 437 ``EM_AMDGPU`` 224 438 ``ELFOSABI_NONE`` 0 439 ``ELFOSABI_AMDGPU_HSA`` 64 440 ``ELFOSABI_AMDGPU_PAL`` 65 441 ``ELFOSABI_AMDGPU_MESA3D`` 66 442 ``ELFABIVERSION_AMDGPU_HSA`` 1 443 ``ELFABIVERSION_AMDGPU_PAL`` 0 444 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 445 =============================== ===== 446 447``e_ident[EI_CLASS]`` 448 The ELF class is: 449 450 * ``ELFCLASS32`` for ``r600`` architecture. 451 452 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64 453 bit applications. 454 455``e_ident[EI_DATA]`` 456 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 457 458``e_ident[EI_OSABI]`` 459 One of the following AMD GPU architecture specific OS ABIs 460 (see :ref:`amdgpu-os-table`): 461 462 * ``ELFOSABI_NONE`` for *unknown* OS. 463 464 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 465 466 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 467 468 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 469 470``e_ident[EI_ABIVERSION]`` 471 The ABI version of the AMD GPU architecture specific OS ABI to which the code 472 object conforms: 473 474 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA 475 runtime ABI. 476 477 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 478 runtime ABI. 479 480 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 481 3D runtime ABI. 482 483``e_type`` 484 Can be one of the following values: 485 486 487 ``ET_REL`` 488 The type produced by the AMD GPU backend compiler as it is relocatable code 489 object. 490 491 ``ET_DYN`` 492 The type produced by the linker as it is a shared code object. 493 494 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 495 496``e_machine`` 497 The value ``EM_AMDGPU`` is used for the machine for all processors supported 498 by the ``r600`` and ``amdgcn`` architectures (see 499 :ref:`amdgpu-processor-table`). The specific processor is specified in the 500 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see 501 :ref:`amdgpu-elf-header-e_flags-table`). 502 503``e_entry`` 504 The entry point is 0 as the entry points for individual kernels must be 505 selected in order to invoke them through AQL packets. 506 507``e_flags`` 508 The AMDGPU backend uses the following ELF header flags: 509 510 .. table:: AMDGPU ELF Header ``e_flags`` 511 :name: amdgpu-elf-header-e_flags-table 512 513 ================================= ========== ============================= 514 Name Value Description 515 ================================= ========== ============================= 516 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. 517 -------------------------------------------- ----------------------------- 518 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection 519 mask for 520 ``EF_AMDGPU_MACH_xxx`` values 521 defined in 522 :ref:`amdgpu-ef-amdgpu-mach-table`. 523 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` 524 target feature is 525 enabled for all code 526 contained in the code object. 527 See 528 :ref:`amdgpu-target-features`. 529 ================================= ========== ============================= 530 531 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 532 :name: amdgpu-ef-amdgpu-mach-table 533 534 ================================= ========== ============================= 535 Name Value Description (see 536 :ref:`amdgpu-processor-table`) 537 ================================= ========== ============================= 538 ``EF_AMDGPU_MACH_NONE`` 0 *not specified* 539 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600`` 540 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630`` 541 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880`` 542 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670`` 543 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710`` 544 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730`` 545 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770`` 546 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar`` 547 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood`` 548 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo`` 549 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper`` 550 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress`` 551 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts`` 552 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks`` 553 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos`` 554 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman`` 555 *reserved* 17-31 Reserved for ``r600`` 556 architecture processors. 557 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600`` 558 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601`` 559 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700`` 560 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701`` 561 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702`` 562 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703`` 563 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704`` 564 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801`` 565 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802`` 566 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803`` 567 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810`` 568 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900`` 569 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902`` 570 ================================= ========== ============================= 571 572Sections 573-------- 574 575An AMDGPU target ELF code object has the standard ELF sections which include: 576 577 .. table:: AMDGPU ELF Sections 578 :name: amdgpu-elf-sections-table 579 580 ================== ================ ================================= 581 Name Type Attributes 582 ================== ================ ================================= 583 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 584 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 585 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 586 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 587 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 588 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 589 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 590 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 591 ``.note`` ``SHT_NOTE`` *none* 592 ``.rela``\ *name* ``SHT_RELA`` *none* 593 ``.rela.dyn`` ``SHT_RELA`` *none* 594 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 595 ``.shstrtab`` ``SHT_STRTAB`` *none* 596 ``.strtab`` ``SHT_STRTAB`` *none* 597 ``.symtab`` ``SHT_SYMTAB`` *none* 598 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 599 ================== ================ ================================= 600 601These sections have their standard meanings (see [ELF]_) and are only generated 602if needed. 603 604``.debug``\ *\** 605 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the 606 DWARF produced by the AMDGPU backend. 607 608``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 609 The standard sections used by a dynamic loader. 610 611``.note`` 612 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 613 backend. 614 615``.rela``\ *name*, ``.rela.dyn`` 616 For relocatable code objects, *name* is the name of the section that the 617 relocation records apply. For example, ``.rela.text`` is the section name for 618 relocation records associated with the ``.text`` section. 619 620 For linked shared code objects, ``.rela.dyn`` contains all the relocation 621 records from each of the relocatable code object's ``.rela``\ *name* sections. 622 623 See :ref:`amdgpu-relocation-records` for the relocation records supported by 624 the AMDGPU backend. 625 626``.text`` 627 The executable machine code for the kernels and functions they call. Generated 628 as position independent code. See :ref:`amdgpu-code-conventions` for 629 information on conventions used in the isa generation. 630 631.. _amdgpu-note-records: 632 633Note Records 634------------ 635 636As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must 637be generated after the ``name`` field to ensure the ``desc`` field is 4 byte 638aligned. In addition, minimal zero byte padding must be generated to ensure the 639``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the 640``.note`` section must be at least 4 to indicate at least 8 byte alignment. 641 642The AMDGPU backend code object uses the following ELF note records in the 643``.note`` section. The *Description* column specifies the layout of the note 644record's ``desc`` field. All fields are consecutive bytes. Note records with 645variable size strings have a corresponding ``*_size`` field that specifies the 646number of bytes, including the terminating null character, in the string. The 647string(s) come immediately after the preceding fields. 648 649Additional note records can be present. 650 651 .. table:: AMDGPU ELF Note Records 652 :name: amdgpu-elf-note-records-table 653 654 ===== ============================== ====================================== 655 Name Type Description 656 ===== ============================== ====================================== 657 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> 658 ===== ============================== ====================================== 659 660.. 661 662 .. table:: AMDGPU ELF Note Record Enumeration Values 663 :name: amdgpu-elf-note-record-enumeration-values-table 664 665 ============================== ===== 666 Name Value 667 ============================== ===== 668 *reserved* 0-9 669 ``NT_AMD_AMDGPU_HSA_METADATA`` 10 670 *reserved* 11 671 ============================== ===== 672 673``NT_AMD_AMDGPU_HSA_METADATA`` 674 Specifies extensible metadata associated with the code objects executed on HSA 675 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when 676 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 677 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code 678 object metadata string. 679 680.. _amdgpu-symbols: 681 682Symbols 683------- 684 685Symbols include the following: 686 687 .. table:: AMDGPU ELF Symbols 688 :name: amdgpu-elf-symbols-table 689 690 ===================== ============== ============= ================== 691 Name Type Section Description 692 ===================== ============== ============= ================== 693 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 694 - ``.rodata`` 695 - ``.bss`` 696 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 697 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 698 ===================== ============== ============= ================== 699 700Global variable 701 Global variables both used and defined by the compilation unit. 702 703 If the symbol is defined in the compilation unit then it is allocated in the 704 appropriate section according to if it has initialized data or is readonly. 705 706 If the symbol is external then its section is ``STN_UNDEF`` and the loader 707 will resolve relocations using the definition provided by another code object 708 or explicitly defined by the runtime. 709 710 All global symbols, whether defined in the compilation unit or external, are 711 accessed by the machine code indirectly through a GOT table entry. This 712 allows them to be preemptable. The GOT table is only supported when the target 713 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). 714 715 .. TODO 716 Add description of linked shared object symbols. Seems undefined symbols 717 are marked as STT_NOTYPE. 718 719Kernel descriptor 720 Every HSA kernel has an associated kernel descriptor. It is the address of the 721 kernel descriptor that is used in the AQL dispatch packet used to invoke the 722 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 723 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 724 725Kernel entry point 726 Every HSA kernel also has a symbol for its machine code entry point. 727 728.. _amdgpu-relocation-records: 729 730Relocation Records 731------------------ 732 733AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 734relocatable fields are: 735 736``word32`` 737 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 738 alignment. These values use the same byte order as other word values in the 739 AMD GPU architecture. 740 741``word64`` 742 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 743 alignment. These values use the same byte order as other word values in the 744 AMD GPU architecture. 745 746Following notations are used for specifying relocation calculations: 747 748**A** 749 Represents the addend used to compute the value of the relocatable field. 750 751**G** 752 Represents the offset into the global offset table at which the relocation 753 entry's symbol will reside during execution. 754 755**GOT** 756 Represents the address of the global offset table. 757 758**P** 759 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 760 of the storage unit being relocated (computed using ``r_offset``). 761 762**S** 763 Represents the value of the symbol whose index resides in the relocation 764 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``. 765 766**B** 767 Represents the base address of a loaded executable or shared object which is 768 the difference between the ELF address and the actual load address. Relocations 769 using this are only valid in executable or shared objects. 770 771The following relocation types are supported: 772 773 .. table:: AMDGPU ELF Relocation Records 774 :name: amdgpu-elf-relocation-records-table 775 776 ========================== ===== ========== ============================== 777 Relocation Type Value Field Calculation 778 ========================== ===== ========== ============================== 779 ``R_AMDGPU_NONE`` 0 *none* *none* 780 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF 781 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32 782 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A 783 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P 784 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P 785 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A 786 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P 787 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 788 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32 789 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF 790 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32 791 *reserved* 12 792 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A 793 ========================== ===== ========== ============================== 794 795.. _amdgpu-dwarf: 796 797DWARF 798----- 799 800Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain 801information that maps the code object executable code and data to the source 802language constructs. It can be used by tools such as debuggers and profilers. 803 804Address Space Mapping 805~~~~~~~~~~~~~~~~~~~~~ 806 807The following address space mapping is used: 808 809 .. table:: AMDGPU DWARF Address Space Mapping 810 :name: amdgpu-dwarf-address-space-mapping-table 811 812 =================== ================= 813 DWARF Address Space Memory Space 814 =================== ================= 815 1 Private (Scratch) 816 2 Local (group/LDS) 817 *omitted* Global 818 *omitted* Constant 819 *omitted* Generic (Flat) 820 *not supported* Region (GDS) 821 =================== ================= 822 823See :ref:`amdgpu-address-spaces` for information on the memory space terminology 824used in the table. 825 826An ``address_class`` attribute is generated on pointer type DIEs to specify the 827DWARF address space of the value of the pointer when it is in the *private* or 828*local* address space. Otherwise the attribute is omitted. 829 830An ``XDEREF`` operation is generated in location list expressions for variables 831that are allocated in the *private* and *local* address space. Otherwise no 832``XDREF`` is omitted. 833 834Register Mapping 835~~~~~~~~~~~~~~~~ 836 837*This section is WIP.* 838 839.. TODO 840 Define DWARF register enumeration. 841 842 If want to present a wavefront state then should expose vector registers as 843 64 wide (rather than per work-item view that LLVM uses). Either as separate 844 registers, or a 64x4 byte single register. In either case use a new LANE op 845 (akin to XDREF) to select the current lane usage in a location 846 expression. This would also allow scalar register spilling to vector register 847 lanes to be expressed (currently no debug information is being generated for 848 spilling). If choose a wide single register approach then use LANE in 849 conjunction with PIECE operation to select the dword part of the register for 850 the current lane. If the separate register approach then use LANE to select 851 the register. 852 853Source Text 854~~~~~~~~~~~ 855 856*This section is WIP.* 857 858.. TODO 859 DWARF extension to include runtime generated source text. 860 861.. _amdgpu-code-conventions: 862 863Code Conventions 864================ 865 866This section provides code conventions used for each supported target triple OS 867(see :ref:`amdgpu-target-triples`). 868 869AMDHSA 870------ 871 872This section provides code conventions used when the target triple OS is 873``amdhsa`` (see :ref:`amdgpu-target-triples`). 874 875.. _amdgpu-amdhsa-hsa-code-object-metadata: 876 877Code Object Metadata 878~~~~~~~~~~~~~~~~~~~~ 879 880The code object metadata specifies extensible metadata associated with the code 881objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm 882[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record 883(see :ref:`amdgpu-note-records`) and is required when the target triple OS is 884``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 885information necessary to support the ROCM kernel queries. For example, the 886segment sizes needed in a dispatch packet. In addition, a high level language 887runtime may require other information to be included. For example, the AMD 888OpenCL runtime records kernel argument information. 889 890The metadata is specified as a YAML formatted string (see [YAML]_ and 891:doc:`YamlIO`). 892 893.. TODO 894 Is the string null terminated? It probably should not if YAML allows it to 895 contain null characters, otherwise it should be. 896 897The metadata is represented as a single YAML document comprised of the mapping 898defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and 899referenced tables. 900 901For boolean values, the string values of ``false`` and ``true`` are used for 902false and true respectively. 903 904Additional information can be added to the mappings. To avoid conflicts, any 905non-AMD key names should be prefixed by "*vendor-name*.". 906 907 .. table:: AMDHSA Code Object Metadata Mapping 908 :name: amdgpu-amdhsa-code-object-metadata-mapping-table 909 910 ========== ============== ========= ======================================= 911 String Key Value Type Required? Description 912 ========== ============== ========= ======================================= 913 "Version" sequence of Required - The first integer is the major 914 2 integers version. Currently 1. 915 - The second integer is the minor 916 version. Currently 0. 917 "Printf" sequence of Each string is encoded information 918 strings about a printf function call. The 919 encoded information is organized as 920 fields separated by colon (':'): 921 922 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 923 924 where: 925 926 ``ID`` 927 A 32 bit integer as a unique id for 928 each printf function call 929 930 ``N`` 931 A 32 bit integer equal to the number 932 of arguments of printf function call 933 minus 1 934 935 ``S[i]`` (where i = 0, 1, ... , N-1) 936 32 bit integers for the size in bytes 937 of the i-th FormatString argument of 938 the printf function call 939 940 FormatString 941 The format string passed to the 942 printf function call. 943 "Kernels" sequence of Required Sequence of the mappings for each 944 mapping kernel in the code object. See 945 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table` 946 for the definition of the mapping. 947 ========== ============== ========= ======================================= 948 949.. 950 951 .. table:: AMDHSA Code Object Kernel Metadata Mapping 952 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table 953 954 ================= ============== ========= ================================ 955 String Key Value Type Required? Description 956 ================= ============== ========= ================================ 957 "Name" string Required Source name of the kernel. 958 "SymbolName" string Required Name of the kernel 959 descriptor ELF symbol. 960 "Language" string Source language of the kernel. 961 Values include: 962 963 - "OpenCL C" 964 - "OpenCL C++" 965 - "HCC" 966 - "OpenMP" 967 968 "LanguageVersion" sequence of - The first integer is the major 969 2 integers version. 970 - The second integer is the 971 minor version. 972 "Attrs" mapping Mapping of kernel attributes. 973 See 974 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table` 975 for the mapping definition. 976 "Args" sequence of Sequence of mappings of the 977 mapping kernel arguments. See 978 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table` 979 for the definition of the mapping. 980 "CodeProps" mapping Mapping of properties related to 981 the kernel code. See 982 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table` 983 for the mapping definition. 984 ================= ============== ========= ================================ 985 986.. 987 988 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping 989 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table 990 991 =================== ============== ========= ============================== 992 String Key Value Type Required? Description 993 =================== ============== ========= ============================== 994 "ReqdWorkGroupSize" sequence of The dispatch work-group size 995 3 integers X, Y, Z must correspond to the 996 specified values. 997 998 Corresponds to the OpenCL 999 ``reqd_work_group_size`` 1000 attribute. 1001 "WorkGroupSizeHint" sequence of The dispatch work-group size 1002 3 integers X, Y, Z is likely to be the 1003 specified values. 1004 1005 Corresponds to the OpenCL 1006 ``work_group_size_hint`` 1007 attribute. 1008 "VecTypeHint" string The name of a scalar or vector 1009 type. 1010 1011 Corresponds to the OpenCL 1012 ``vec_type_hint`` attribute. 1013 1014 "RuntimeHandle" string The external symbol name 1015 associated with a kernel. 1016 OpenCL runtime allocates a 1017 global buffer for the symbol 1018 and saves the kernel's address 1019 to it, which is used for 1020 device side enqueueing. Only 1021 available for device side 1022 enqueued kernels. 1023 =================== ============== ========= ============================== 1024 1025.. 1026 1027 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping 1028 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table 1029 1030 ================= ============== ========= ================================ 1031 String Key Value Type Required? Description 1032 ================= ============== ========= ================================ 1033 "Name" string Kernel argument name. 1034 "TypeName" string Kernel argument type name. 1035 "Size" integer Required Kernel argument size in bytes. 1036 "Align" integer Required Kernel argument alignment in 1037 bytes. Must be a power of two. 1038 "ValueKind" string Required Kernel argument kind that 1039 specifies how to set up the 1040 corresponding argument. 1041 Values include: 1042 1043 "ByValue" 1044 The argument is copied 1045 directly into the kernarg. 1046 1047 "GlobalBuffer" 1048 A global address space pointer 1049 to the buffer data is passed 1050 in the kernarg. 1051 1052 "DynamicSharedPointer" 1053 A group address space pointer 1054 to dynamically allocated LDS 1055 is passed in the kernarg. 1056 1057 "Sampler" 1058 A global address space 1059 pointer to a S# is passed in 1060 the kernarg. 1061 1062 "Image" 1063 A global address space 1064 pointer to a T# is passed in 1065 the kernarg. 1066 1067 "Pipe" 1068 A global address space pointer 1069 to an OpenCL pipe is passed in 1070 the kernarg. 1071 1072 "Queue" 1073 A global address space pointer 1074 to an OpenCL device enqueue 1075 queue is passed in the 1076 kernarg. 1077 1078 "HiddenGlobalOffsetX" 1079 The OpenCL grid dispatch 1080 global offset for the X 1081 dimension is passed in the 1082 kernarg. 1083 1084 "HiddenGlobalOffsetY" 1085 The OpenCL grid dispatch 1086 global offset for the Y 1087 dimension is passed in the 1088 kernarg. 1089 1090 "HiddenGlobalOffsetZ" 1091 The OpenCL grid dispatch 1092 global offset for the Z 1093 dimension is passed in the 1094 kernarg. 1095 1096 "HiddenNone" 1097 An argument that is not used 1098 by the kernel. Space needs to 1099 be left for it, but it does 1100 not need to be set up. 1101 1102 "HiddenPrintfBuffer" 1103 A global address space pointer 1104 to the runtime printf buffer 1105 is passed in kernarg. 1106 1107 "HiddenDefaultQueue" 1108 A global address space pointer 1109 to the OpenCL device enqueue 1110 queue that should be used by 1111 the kernel by default is 1112 passed in the kernarg. 1113 1114 "HiddenCompletionAction" 1115 A global address space pointer 1116 to help link enqueued kernels into 1117 the ancestor tree for determining 1118 when the parent kernel has finished. 1119 1120 "ValueType" string Required Kernel argument value type. Only 1121 present if "ValueKind" is 1122 "ByValue". For vector data 1123 types, the value is for the 1124 element type. Values include: 1125 1126 - "Struct" 1127 - "I8" 1128 - "U8" 1129 - "I16" 1130 - "U16" 1131 - "F16" 1132 - "I32" 1133 - "U32" 1134 - "F32" 1135 - "I64" 1136 - "U64" 1137 - "F64" 1138 1139 .. TODO 1140 How can it be determined if a 1141 vector type, and what size 1142 vector? 1143 "PointeeAlign" integer Alignment in bytes of pointee 1144 type for pointer type kernel 1145 argument. Must be a power 1146 of 2. Only present if 1147 "ValueKind" is 1148 "DynamicSharedPointer". 1149 "AddrSpaceQual" string Kernel argument address space 1150 qualifier. Only present if 1151 "ValueKind" is "GlobalBuffer" or 1152 "DynamicSharedPointer". Values 1153 are: 1154 1155 - "Private" 1156 - "Global" 1157 - "Constant" 1158 - "Local" 1159 - "Generic" 1160 - "Region" 1161 1162 .. TODO 1163 Is GlobalBuffer only Global 1164 or Constant? Is 1165 DynamicSharedPointer always 1166 Local? Can HCC allow Generic? 1167 How can Private or Region 1168 ever happen? 1169 "AccQual" string Kernel argument access 1170 qualifier. Only present if 1171 "ValueKind" is "Image" or 1172 "Pipe". Values 1173 are: 1174 1175 - "ReadOnly" 1176 - "WriteOnly" 1177 - "ReadWrite" 1178 1179 .. TODO 1180 Does this apply to 1181 GlobalBuffer? 1182 "ActualAccQual" string The actual memory accesses 1183 performed by the kernel on the 1184 kernel argument. Only present if 1185 "ValueKind" is "GlobalBuffer", 1186 "Image", or "Pipe". This may be 1187 more restrictive than indicated 1188 by "AccQual" to reflect what the 1189 kernel actual does. If not 1190 present then the runtime must 1191 assume what is implied by 1192 "AccQual" and "IsConst". Values 1193 are: 1194 1195 - "ReadOnly" 1196 - "WriteOnly" 1197 - "ReadWrite" 1198 1199 "IsConst" boolean Indicates if the kernel argument 1200 is const qualified. Only present 1201 if "ValueKind" is 1202 "GlobalBuffer". 1203 1204 "IsRestrict" boolean Indicates if the kernel argument 1205 is restrict qualified. Only 1206 present if "ValueKind" is 1207 "GlobalBuffer". 1208 1209 "IsVolatile" boolean Indicates if the kernel argument 1210 is volatile qualified. Only 1211 present if "ValueKind" is 1212 "GlobalBuffer". 1213 1214 "IsPipe" boolean Indicates if the kernel argument 1215 is pipe qualified. Only present 1216 if "ValueKind" is "Pipe". 1217 1218 .. TODO 1219 Can GlobalBuffer be pipe 1220 qualified? 1221 ================= ============== ========= ================================ 1222 1223.. 1224 1225 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping 1226 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table 1227 1228 ============================ ============== ========= ===================== 1229 String Key Value Type Required? Description 1230 ============================ ============== ========= ===================== 1231 "KernargSegmentSize" integer Required The size in bytes of 1232 the kernarg segment 1233 that holds the values 1234 of the arguments to 1235 the kernel. 1236 "GroupSegmentFixedSize" integer Required The amount of group 1237 segment memory 1238 required by a 1239 work-group in 1240 bytes. This does not 1241 include any 1242 dynamically allocated 1243 group segment memory 1244 that may be added 1245 when the kernel is 1246 dispatched. 1247 "PrivateSegmentFixedSize" integer Required The amount of fixed 1248 private address space 1249 memory required for a 1250 work-item in 1251 bytes. If the kernel 1252 uses a dynamic call 1253 stack then additional 1254 space must be added 1255 to this value for the 1256 call stack. 1257 "KernargSegmentAlign" integer Required The maximum byte 1258 alignment of 1259 arguments in the 1260 kernarg segment. Must 1261 be a power of 2. 1262 "WavefrontSize" integer Required Wavefront size. Must 1263 be a power of 2. 1264 "NumSGPRs" integer Required Number of scalar 1265 registers used by a 1266 wavefront for 1267 GFX6-GFX9. This 1268 includes the special 1269 SGPRs for VCC, Flat 1270 Scratch (GFX7-GFX9) 1271 and XNACK (for 1272 GFX8-GFX9). It does 1273 not include the 16 1274 SGPR added if a trap 1275 handler is 1276 enabled. It is not 1277 rounded up to the 1278 allocation 1279 granularity. 1280 "NumVGPRs" integer Required Number of vector 1281 registers used by 1282 each work-item for 1283 GFX6-GFX9 1284 "MaxFlatWorkGroupSize" integer Required Maximum flat 1285 work-group size 1286 supported by the 1287 kernel in work-items. 1288 Must be >=1 and 1289 consistent with any 1290 non-0 values in 1291 FixedWorkGroupSize. 1292 "FixedWorkGroupSize" sequence of Corresponds to the 1293 3 integers dispatch work-group 1294 size X, Y, Z. If 1295 omitted, defaults to 1296 0, 0, 0. If an 1297 element is non-0 then 1298 the kernel must only 1299 be launched with a 1300 matching corresponding 1301 work-group size. 1302 "NumSpilledSGPRs" integer Number of stores from 1303 a scalar register to 1304 a register allocator 1305 created spill 1306 location. 1307 "NumSpilledVGPRs" integer Number of stores from 1308 a vector register to 1309 a register allocator 1310 created spill 1311 location. 1312 ============================ ============== ========= ===================== 1313 1314.. 1315 1316Kernel Dispatch 1317~~~~~~~~~~~~~~~ 1318 1319The HSA architected queuing language (AQL) defines a user space memory interface 1320that can be used to control the dispatch of kernels, in an agent independent 1321way. An agent can have zero or more AQL queues created for it using the ROCm 1322runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the 1323*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue 1324mechanics and packet layouts. 1325 1326The packet processor of a kernel agent is responsible for detecting and 1327dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 1328packet processor is implemented by the hardware command processor (CP), 1329asynchronous dispatch controller (ADC) and shader processor input controller 1330(SPI). 1331 1332The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel 1333mode driver to initialize and register the AQL queue with CP. 1334 1335To dispatch a kernel the following actions are performed. This can occur in the 1336CPU host program, or from an HSA kernel executing on a GPU. 1337 13381. A pointer to an AQL queue for the kernel agent on which the kernel is to be 1339 executed is obtained. 13402. A pointer to the kernel descriptor (see 1341 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is 1342 obtained. It must be for a kernel that is contained in a code object that that 1343 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is 1344 associated. 13453. Space is allocated for the kernel arguments using the ROCm runtime allocator 1346 for a memory region with the kernarg property for the kernel agent that will 1347 execute the kernel. It must be at least 16 byte aligned. 13484. Kernel argument values are assigned to the kernel argument memory 1349 allocation. The layout is defined in the *HSA Programmer's Language Reference* 1350 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument 1351 memory in the same way constant memory is accessed. (Note that the HSA 1352 specification allows an implementation to copy the kernel argument contents to 1353 another location that is accessed by the kernel.) 13545. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime 1355 api uses 64 bit atomic operations to reserve space in the AQL queue for the 1356 packet. The packet must be set up, and the final write must use an atomic 1357 store release to set the packet kind to ensure the packet contents are 1358 visible to the kernel agent. AQL defines a doorbell signal mechanism to 1359 notify the kernel agent that the AQL queue has been updated. These rules, and 1360 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 1361 System Architecture Specification* [HSA]_. 13626. A kernel dispatch packet includes information about the actual dispatch, 1363 such as grid and work-group size, together with information from the code 1364 object about the kernel, such as segment sizes. The ROCm runtime queries on 1365 the kernel symbol can be used to obtain the code object values which are 1366 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`. 13677. CP executes micro-code and is responsible for detecting and setting up the 1368 GPU to execute the wavefronts of a kernel dispatch. 13698. CP ensures that when the a wavefront starts executing the kernel machine 1370 code, the scalar general purpose registers (SGPR) and vector general purpose 1371 registers (VGPR) are set up as required by the machine code. The required 1372 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 1373 register state is defined in 1374 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 13759. The prolog of the kernel machine code (see 1376 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 1377 before continuing executing the machine code that corresponds to the kernel. 137810. When the kernel dispatch has completed execution, CP signals the completion 1379 signal specified in the kernel dispatch packet if not 0. 1380 1381.. _amdgpu-amdhsa-memory-spaces: 1382 1383Memory Spaces 1384~~~~~~~~~~~~~ 1385 1386The memory space properties are: 1387 1388 .. table:: AMDHSA Memory Spaces 1389 :name: amdgpu-amdhsa-memory-spaces-table 1390 1391 ================= =========== ======== ======= ================== 1392 Memory Space Name HSA Segment Hardware Address NULL Value 1393 Name Name Size 1394 ================= =========== ======== ======= ================== 1395 Private private scratch 32 0x00000000 1396 Local group LDS 32 0xFFFFFFFF 1397 Global global global 64 0x0000000000000000 1398 Constant constant *same as 64 0x0000000000000000 1399 global* 1400 Generic flat flat 64 0x0000000000000000 1401 Region N/A GDS 32 *not implemented 1402 for AMDHSA* 1403 ================= =========== ======== ======= ================== 1404 1405The global and constant memory spaces both use global virtual addresses, which 1406are the same virtual address space used by the CPU. However, some virtual 1407addresses may only be accessible to the CPU, some only accessible by the GPU, 1408and some by both. 1409 1410Using the constant memory space indicates that the data will not change during 1411the execution of the kernel. This allows scalar read instructions to be 1412used. The vector and scalar L1 caches are invalidated of volatile data before 1413each kernel dispatch execution to allow constant memory to change values between 1414kernel dispatches. 1415 1416The local memory space uses the hardware Local Data Store (LDS) which is 1417automatically allocated when the hardware creates work-groups of wavefronts, and 1418freed when all the wavefronts of a work-group have terminated. The data store 1419(DS) instructions can be used to access it. 1420 1421The private memory space uses the hardware scratch memory support. If the kernel 1422uses scratch, then the hardware allocates memory that is accessed using 1423wavefront lane dword (4 byte) interleaving. The mapping used from private 1424address to physical address is: 1425 1426 ``wavefront-scratch-base + 1427 (private-address * wavefront-size * 4) + 1428 (wavefront-lane-id * 4)`` 1429 1430There are different ways that the wavefront scratch base address is determined 1431by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 1432memory can be accessed in an interleaved manner using buffer instruction with 1433the scratch buffer descriptor and per wave scratch offset, by the scratch 1434instructions, or by flat instructions. If each lane of a wavefront accesses the 1435same private address, the interleaving results in adjacent dwords being accessed 1436and hence requires fewer cache lines to be fetched. Multi-dword access is not 1437supported except by flat and scratch instructions in GFX9. 1438 1439The generic address space uses the hardware flat address support available in 1440GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and 1441local appertures), that are outside the range of addressible global memory, to 1442map from a flat address to a private or local address. 1443 1444FLAT instructions can take a flat address and access global, private (scratch) 1445and group (LDS) memory depending in if the address is within one of the 1446apperture ranges. Flat access to scratch requires hardware aperture setup and 1447setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat 1448access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup 1449(see :ref:`amdgpu-amdhsa-m0`). 1450 1451To convert between a segment address and a flat address the base address of the 1452appertures address can be used. For GFX7-GFX8 these are available in the 1453:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 1454Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 1455GFX9 the appature base addresses are directly available as inline constant 1456registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 1457address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 1458which makes it easier to convert from flat to segment or segment to flat. 1459 1460Image and Samplers 1461~~~~~~~~~~~~~~~~~~ 1462 1463Image and sample handles created by the ROCm runtime are 64 bit addresses of a 1464hardware 32 byte V# and 48 byte S# object respectively. In order to support the 1465HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG 1466enumeration values for the queries that are not trivially deducible from the S# 1467representation. 1468 1469HSA Signals 1470~~~~~~~~~~~ 1471 1472HSA signal handles created by the ROCm runtime are 64 bit addresses of a 1473structure allocated in memory accessible from both the CPU and GPU. The 1474structure is defined by the ROCm runtime and subject to change between releases 1475(see [AMD-ROCm-github]_). 1476 1477.. _amdgpu-amdhsa-hsa-aql-queue: 1478 1479HSA AQL Queue 1480~~~~~~~~~~~~~ 1481 1482The HSA AQL queue structure is defined by the ROCm runtime and subject to change 1483between releases (see [AMD-ROCm-github]_). For some processors it contains 1484fields needed to implement certain language features such as the flat address 1485aperture bases. It also contains fields used by CP such as managing the 1486allocation of scratch memory. 1487 1488.. _amdgpu-amdhsa-kernel-descriptor: 1489 1490Kernel Descriptor 1491~~~~~~~~~~~~~~~~~ 1492 1493A kernel descriptor consists of the information needed by CP to initiate the 1494execution of a kernel, including the entry point address of the machine code 1495that implements the kernel. 1496 1497Kernel Descriptor for GFX6-GFX9 1498+++++++++++++++++++++++++++++++ 1499 1500CP microcode requires the Kernel descritor to be allocated on 64 byte alignment. 1501 1502 .. table:: Kernel Descriptor for GFX6-GFX9 1503 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table 1504 1505 ======= ======= =============================== ============================ 1506 Bits Size Field Name Description 1507 ======= ======= =============================== ============================ 1508 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local 1509 address space memory 1510 required for a work-group 1511 in bytes. This does not 1512 include any dynamically 1513 allocated local address 1514 space memory that may be 1515 added when the kernel is 1516 dispatched. 1517 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed 1518 private address space 1519 memory required for a 1520 work-item in bytes. If 1521 is_dynamic_callstack is 1 1522 then additional space must 1523 be added to this value for 1524 the call stack. 1525 127:64 8 bytes Reserved, must be 0. 1526 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly 1527 negative) from base 1528 address of kernel 1529 descriptor to kernel's 1530 entry point instruction 1531 which must be 256 byte 1532 aligned. 1533 223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group 1534 size supported by the 1535 kernel in work-items. If 1536 an exact work-group size 1537 is required then must be 1538 omitted or 0 and 1539 ReqdWorkGroupSize* must 1540 be set to non-0. 1541 239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then 1542 the kernel 1543 must be executed with the 1544 specified work-group size 1545 for X. 1546 255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then 1547 the kernel 1548 must be executed with the 1549 specified work-group size 1550 for Y. 1551 271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then 1552 the kernel 1553 must be executed with the 1554 specified work-group size 1555 for Z. 1556 383:272 14 Reserved, must be 0. 1557 bytes 1558 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS) 1559 program settings used by 1560 CP to set up 1561 ``COMPUTE_PGM_RSRC1`` 1562 configuration 1563 register. See 1564 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 1565 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS) 1566 program settings used by 1567 CP to set up 1568 ``COMPUTE_PGM_RSRC2`` 1569 configuration 1570 register. See 1571 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 1572 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the 1573 SGPR user data registers 1574 (see 1575 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1576 1577 The total number of SGPR 1578 user data registers 1579 requested must not exceed 1580 16 and match value in 1581 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 1582 Any requests beyond 16 1583 will be ignored. 1584 449 1 bit EnableSGPRDispatchPtr *see above* 1585 450 1 bit EnableSGPRQueuePtr *see above* 1586 451 1 bit EnableSGPRKernargSegmentPtr *see above* 1587 452 1 bit EnableSGPRDispatchID *see above* 1588 453 1 bit EnableSGPRFlatScratchInit *see above* 1589 454 1 bit EnableSGPRPrivateSegmentSize *see above* 1590 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and 1591 should always be 0. 1592 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and 1593 should always be 0. 1594 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and 1595 should always be 0. 1596 463:458 6 bits Reserved, must be 0. 1597 511:464 6 Reserved, must be 0. 1598 bytes 1599 512 **Total size 64 bytes.** 1600 ======= ==================================================================== 1601 1602.. 1603 1604 .. table:: compute_pgm_rsrc1 for GFX6-GFX9 1605 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table 1606 1607 ======= ======= =============================== =========================================================================== 1608 Bits Size Field Name Description 1609 ======= ======= =============================== =========================================================================== 1610 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers 1611 used by each work-item, 1612 granularity is device 1613 specific: 1614 1615 GFX6-GFX9 1616 - max_vgpr 1..256 1617 - roundup((max_vgpg + 1) 1618 / 4) - 1 1619 1620 Used by CP to set up 1621 ``COMPUTE_PGM_RSRC1.VGPRS``. 1622 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers 1623 used by a wavefront, 1624 granularity is device 1625 specific: 1626 1627 GFX6-GFX8 1628 - max_sgpr 1..112 1629 - roundup((max_sgpg + 1) 1630 / 8) - 1 1631 GFX9 1632 - max_sgpr 1..112 1633 - roundup((max_sgpg + 1) 1634 / 16) - 1 1635 1636 Includes the special SGPRs 1637 for VCC, Flat Scratch (for 1638 GFX7 onwards) and XNACK 1639 (for GFX8 onwards). It does 1640 not include the 16 SGPR 1641 added if a trap handler is 1642 enabled. 1643 1644 Used by CP to set up 1645 ``COMPUTE_PGM_RSRC1.SGPRS``. 1646 11:10 2 bits PRIORITY Must be 0. 1647 1648 Start executing wavefront 1649 at the specified priority. 1650 1651 CP is responsible for 1652 filling in 1653 ``COMPUTE_PGM_RSRC1.PRIORITY``. 1654 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 1655 with specified rounding 1656 mode for single (32 1657 bit) floating point 1658 precision floating point 1659 operations. 1660 1661 Floating point rounding 1662 mode values are defined in 1663 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 1664 1665 Used by CP to set up 1666 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1667 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 1668 with specified rounding 1669 denorm mode for half/double (16 1670 and 64 bit) floating point 1671 precision floating point 1672 operations. 1673 1674 Floating point rounding 1675 mode values are defined in 1676 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 1677 1678 Used by CP to set up 1679 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1680 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 1681 with specified denorm mode 1682 for single (32 1683 bit) floating point 1684 precision floating point 1685 operations. 1686 1687 Floating point denorm mode 1688 values are defined in 1689 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 1690 1691 Used by CP to set up 1692 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1693 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 1694 with specified denorm mode 1695 for half/double (16 1696 and 64 bit) floating point 1697 precision floating point 1698 operations. 1699 1700 Floating point denorm mode 1701 values are defined in 1702 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 1703 1704 Used by CP to set up 1705 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1706 20 1 bit PRIV Must be 0. 1707 1708 Start executing wavefront 1709 in privilege trap handler 1710 mode. 1711 1712 CP is responsible for 1713 filling in 1714 ``COMPUTE_PGM_RSRC1.PRIV``. 1715 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 1716 with DX10 clamp mode 1717 enabled. Used by the vector 1718 ALU to force DX10 style 1719 treatment of NaN's (when 1720 set, clamp NaN to zero, 1721 otherwise pass NaN 1722 through). 1723 1724 Used by CP to set up 1725 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 1726 22 1 bit DEBUG_MODE Must be 0. 1727 1728 Start executing wavefront 1729 in single step mode. 1730 1731 CP is responsible for 1732 filling in 1733 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 1734 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 1735 with IEEE mode 1736 enabled. Floating point 1737 opcodes that support 1738 exception flag gathering 1739 will quiet and propagate 1740 signaling-NaN inputs per 1741 IEEE 754-2008. Min_dx10 and 1742 max_dx10 become IEEE 1743 754-2008 compliant due to 1744 signaling-NaN propagation 1745 and quieting. 1746 1747 Used by CP to set up 1748 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 1749 24 1 bit BULKY Must be 0. 1750 1751 Only one work-group allowed 1752 to execute on a compute 1753 unit. 1754 1755 CP is responsible for 1756 filling in 1757 ``COMPUTE_PGM_RSRC1.BULKY``. 1758 25 1 bit CDBG_USER Must be 0. 1759 1760 Flag that can be used to 1761 control debugging code. 1762 1763 CP is responsible for 1764 filling in 1765 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 1766 26 1 bit FP16_OVFL GFX6-GFX8 1767 Reserved, must be 0. 1768 GFX9 1769 Wavefront starts execution 1770 with specified fp16 overflow 1771 mode. 1772 1773 - If 0, fp16 overflow generates 1774 +/-INF values. 1775 - If 1, fp16 overflow that is the 1776 result of an +/-INF input value 1777 or divide by 0 produces a +/-INF, 1778 otherwise clamps computed 1779 overflow to +/-MAX_FP16 as 1780 appropriate. 1781 1782 Used by CP to set up 1783 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 1784 31:27 5 bits Reserved, must be 0. 1785 32 **Total size 4 bytes** 1786 ======= =================================================================================================================== 1787 1788.. 1789 1790 .. table:: compute_pgm_rsrc2 for GFX6-GFX9 1791 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table 1792 1793 ======= ======= =============================== =========================================================================== 1794 Bits Size Field Name Description 1795 ======= ======= =============================== =========================================================================== 1796 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 1797 _WAVE_OFFSET SGPR wave scratch offset 1798 system register (see 1799 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1800 1801 Used by CP to set up 1802 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 1803 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 1804 user data registers 1805 requested. This number must 1806 match the number of user 1807 data registers enabled. 1808 1809 Used by CP to set up 1810 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 1811 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a 1812 TRAP instruction which 1813 requires a trap handler to 1814 be enabled. 1815 1816 CP sets 1817 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT`` 1818 if the runtime has 1819 installed a trap handler 1820 regardless of the setting 1821 of this field. 1822 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 1823 system SGPR register for 1824 the work-group id in the X 1825 dimension (see 1826 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1827 1828 Used by CP to set up 1829 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 1830 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 1831 system SGPR register for 1832 the work-group id in the Y 1833 dimension (see 1834 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1835 1836 Used by CP to set up 1837 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 1838 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 1839 system SGPR register for 1840 the work-group id in the Z 1841 dimension (see 1842 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1843 1844 Used by CP to set up 1845 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 1846 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 1847 system SGPR register for 1848 work-group information (see 1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1850 1851 Used by CP to set up 1852 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 1853 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 1854 VGPR system registers used 1855 for the work-item ID. 1856 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 1857 defines the values. 1858 1859 Used by CP to set up 1860 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 1861 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 1862 1863 Wavefront starts execution 1864 with address watch 1865 exceptions enabled which 1866 are generated when L1 has 1867 witnessed a thread access 1868 an *address of 1869 interest*. 1870 1871 CP is responsible for 1872 filling in the address 1873 watch bit in 1874 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 1875 according to what the 1876 runtime requests. 1877 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 1878 1879 Wavefront starts execution 1880 with memory violation 1881 exceptions exceptions 1882 enabled which are generated 1883 when a memory violation has 1884 occurred for this wave from 1885 L1 or LDS 1886 (write-to-read-only-memory, 1887 mis-aligned atomic, LDS 1888 address out of range, 1889 illegal address, etc.). 1890 1891 CP sets the memory 1892 violation bit in 1893 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 1894 according to what the 1895 runtime requests. 1896 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 1897 1898 CP uses the rounded value 1899 from the dispatch packet, 1900 not this value, as the 1901 dispatch may contain 1902 dynamically allocated group 1903 segment memory. CP writes 1904 directly to 1905 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 1906 1907 Amount of group segment 1908 (LDS) to allocate for each 1909 work-group. Granularity is 1910 device specific: 1911 1912 GFX6: 1913 roundup(lds-size / (64 * 4)) 1914 GFX7-GFX9: 1915 roundup(lds-size / (128 * 4)) 1916 1917 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 1918 _INVALID_OPERATION with specified exceptions 1919 enabled. 1920 1921 Used by CP to set up 1922 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 1923 (set from bits 0..6). 1924 1925 IEEE 754 FP Invalid 1926 Operation 1927 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 1928 _SOURCE input operands is a 1929 denormal number 1930 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 1931 _DIVISION_BY_ZERO Zero 1932 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 1933 _OVERFLOW 1934 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 1935 _UNDERFLOW 1936 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 1937 _INEXACT 1938 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 1939 _ZERO (rcp_iflag_f32 instruction 1940 only) 1941 31 1 bit Reserved, must be 0. 1942 32 **Total size 4 bytes.** 1943 ======= =================================================================================================================== 1944 1945.. 1946 1947 .. table:: Floating Point Rounding Mode Enumeration Values 1948 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 1949 1950 ====================================== ===== ============================== 1951 Enumeration Name Value Description 1952 ====================================== ===== ============================== 1953 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 1954 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 1955 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 1956 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 1957 ====================================== ===== ============================== 1958 1959.. 1960 1961 .. table:: Floating Point Denorm Mode Enumeration Values 1962 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 1963 1964 ====================================== ===== ============================== 1965 Enumeration Name Value Description 1966 ====================================== ===== ============================== 1967 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 1968 Denorms 1969 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 1970 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 1971 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 1972 ====================================== ===== ============================== 1973 1974.. 1975 1976 .. table:: System VGPR Work-Item ID Enumeration Values 1977 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 1978 1979 ======================================== ===== ============================ 1980 Enumeration Name Value Description 1981 ======================================== ===== ============================ 1982 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 1983 ID. 1984 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 1985 dimensions ID. 1986 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 1987 dimensions ID. 1988 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 1989 ======================================== ===== ============================ 1990 1991.. _amdgpu-amdhsa-initial-kernel-execution-state: 1992 1993Initial Kernel Execution State 1994~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1995 1996This section defines the register state that will be set up by the packet 1997processor prior to the start of execution of every wavefront. This is limited by 1998the constraints of the hardware controllers of CP/ADC/SPI. 1999 2000The order of the SGPR registers is defined, but the compiler can specify which 2001ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 2002fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2003for enabled registers are dense starting at SGPR0: the first enabled register is 2004SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 2005an SGPR number. 2006 2007The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 2008all waves of the grid. It is possible to specify more than 16 User SGPRs using 2009the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually 2010initialized. These are then immediately followed by the System SGPRs that are 2011set up by ADC/SPI and can have different values for each wave of the grid 2012dispatch. 2013 2014SGPR register initial state is defined in 2015:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 2016 2017 .. table:: SGPR Register Set Up Order 2018 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 2019 2020 ========== ========================== ====== ============================== 2021 SGPR Order Name Number Description 2022 (kernel descriptor enable of 2023 field) SGPRs 2024 ========== ========================== ====== ============================== 2025 First Private Segment Buffer 4 V# that can be used, together 2026 (enable_sgpr_private with Scratch Wave Offset as an 2027 _segment_buffer) offset, to access the private 2028 memory space using a segment 2029 address. 2030 2031 CP uses the value provided by 2032 the runtime. 2033 then Dispatch Ptr 2 64 bit address of AQL dispatch 2034 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 2035 actually executing. 2036 then Queue Ptr 2 64 bit address of amd_queue_t 2037 (enable_sgpr_queue_ptr) object for AQL queue on which 2038 the dispatch packet was 2039 queued. 2040 then Kernarg Segment Ptr 2 64 bit address of Kernarg 2041 (enable_sgpr_kernarg segment. This is directly 2042 _segment_ptr) copied from the 2043 kernarg_address in the kernel 2044 dispatch packet. 2045 2046 Having CP load it once avoids 2047 loading it at the beginning of 2048 every wavefront. 2049 then Dispatch Id 2 64 bit Dispatch ID of the 2050 (enable_sgpr_dispatch_id) dispatch packet being 2051 executed. 2052 then Flat Scratch Init 2 This is 2 SGPRs: 2053 (enable_sgpr_flat_scratch 2054 _init) GFX6 2055 Not supported. 2056 GFX7-GFX8 2057 The first SGPR is a 32 bit 2058 byte offset from 2059 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2060 to per SPI base of memory 2061 for scratch for the queue 2062 executing the kernel 2063 dispatch. CP obtains this 2064 from the runtime. (The 2065 Scratch Segment Buffer base 2066 address is 2067 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2068 plus this offset.) The value 2069 of Scratch Wave Offset must 2070 be added to this offset by 2071 the kernel machine code, 2072 right shifted by 8, and 2073 moved to the FLAT_SCRATCH_HI 2074 SGPR register. 2075 FLAT_SCRATCH_HI corresponds 2076 to SGPRn-4 on GFX7, and 2077 SGPRn-6 on GFX8 (where SGPRn 2078 is the highest numbered SGPR 2079 allocated to the wave). 2080 FLAT_SCRATCH_HI is 2081 multiplied by 256 (as it is 2082 in units of 256 bytes) and 2083 added to 2084 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2085 to calculate the per wave 2086 FLAT SCRATCH BASE in flat 2087 memory instructions that 2088 access the scratch 2089 apperture. 2090 2091 The second SGPR is 32 bit 2092 byte size of a single 2093 work-item's scratch memory 2094 usage. CP obtains this from 2095 the runtime, and it is 2096 always a multiple of DWORD. 2097 CP checks that the value in 2098 the kernel dispatch packet 2099 Private Segment Byte Size is 2100 not larger, and requests the 2101 runtime to increase the 2102 queue's scratch size if 2103 necessary. The kernel code 2104 must move it to 2105 FLAT_SCRATCH_LO which is 2106 SGPRn-3 on GFX7 and SGPRn-5 2107 on GFX8. FLAT_SCRATCH_LO is 2108 used as the FLAT SCRATCH 2109 SIZE in flat memory 2110 instructions. Having CP load 2111 it once avoids loading it at 2112 the beginning of every 2113 wavefront. 2114 GFX9 2115 This is the 2116 64 bit base address of the 2117 per SPI scratch backing 2118 memory managed by SPI for 2119 the queue executing the 2120 kernel dispatch. CP obtains 2121 this from the runtime (and 2122 divides it if there are 2123 multiple Shader Arrays each 2124 with its own SPI). The value 2125 of Scratch Wave Offset must 2126 be added by the kernel 2127 machine code and the result 2128 moved to the FLAT_SCRATCH 2129 SGPR which is SGPRn-6 and 2130 SGPRn-5. It is used as the 2131 FLAT SCRATCH BASE in flat 2132 memory instructions. 2133 then Private Segment Size 1 The 32 bit byte size of a 2134 (enable_sgpr_private single 2135 work-item's 2136 scratch_segment_size) memory 2137 allocation. This is the 2138 value from the kernel 2139 dispatch packet Private 2140 Segment Byte Size rounded up 2141 by CP to a multiple of 2142 DWORD. 2143 2144 Having CP load it once avoids 2145 loading it at the beginning of 2146 every wavefront. 2147 2148 This is not used for 2149 GFX7-GFX8 since it is the same 2150 value as the second SGPR of 2151 Flat Scratch Init. However, it 2152 may be needed for GFX9 which 2153 changes the meaning of the 2154 Flat Scratch Init value. 2155 then Grid Work-Group Count X 1 32 bit count of the number of 2156 (enable_sgpr_grid work-groups in the X dimension 2157 _workgroup_count_X) for the grid being 2158 executed. Computed from the 2159 fields in the kernel dispatch 2160 packet as ((grid_size.x + 2161 workgroup_size.x - 1) / 2162 workgroup_size.x). 2163 then Grid Work-Group Count Y 1 32 bit count of the number of 2164 (enable_sgpr_grid work-groups in the Y dimension 2165 _workgroup_count_Y && for the grid being 2166 less than 16 previous executed. Computed from the 2167 SGPRs) fields in the kernel dispatch 2168 packet as ((grid_size.y + 2169 workgroup_size.y - 1) / 2170 workgroupSize.y). 2171 2172 Only initialized if <16 2173 previous SGPRs initialized. 2174 then Grid Work-Group Count Z 1 32 bit count of the number of 2175 (enable_sgpr_grid work-groups in the Z dimension 2176 _workgroup_count_Z && for the grid being 2177 less than 16 previous executed. Computed from the 2178 SGPRs) fields in the kernel dispatch 2179 packet as ((grid_size.z + 2180 workgroup_size.z - 1) / 2181 workgroupSize.z). 2182 2183 Only initialized if <16 2184 previous SGPRs initialized. 2185 then Work-Group Id X 1 32 bit work-group id in X 2186 (enable_sgpr_workgroup_id dimension of grid for 2187 _X) wavefront. 2188 then Work-Group Id Y 1 32 bit work-group id in Y 2189 (enable_sgpr_workgroup_id dimension of grid for 2190 _Y) wavefront. 2191 then Work-Group Id Z 1 32 bit work-group id in Z 2192 (enable_sgpr_workgroup_id dimension of grid for 2193 _Z) wavefront. 2194 then Work-Group Info 1 {first_wave, 14'b0000, 2195 (enable_sgpr_workgroup ordered_append_term[10:0], 2196 _info) threadgroup_size_in_waves[5:0]} 2197 then Scratch Wave Offset 1 32 bit byte offset from base 2198 (enable_sgpr_private of scratch base of queue 2199 _segment_wave_offset) executing the kernel 2200 dispatch. Must be used as an 2201 offset with Private 2202 segment address when using 2203 Scratch Segment Buffer. It 2204 must be used to set up FLAT 2205 SCRATCH for flat addressing 2206 (see 2207 :ref:`amdgpu-amdhsa-flat-scratch`). 2208 ========== ========================== ====== ============================== 2209 2210The order of the VGPR registers is defined, but the compiler can specify which 2211ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 2212fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2213for enabled registers are dense starting at VGPR0: the first enabled register is 2214VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 2215VGPR number. 2216 2217VGPR register initial state is defined in 2218:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 2219 2220 .. table:: VGPR Register Set Up Order 2221 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 2222 2223 ========== ========================== ====== ============================== 2224 VGPR Order Name Number Description 2225 (kernel descriptor enable of 2226 field) VGPRs 2227 ========== ========================== ====== ============================== 2228 First Work-Item Id X 1 32 bit work item id in X 2229 (Always initialized) dimension of work-group for 2230 wavefront lane. 2231 then Work-Item Id Y 1 32 bit work item id in Y 2232 (enable_vgpr_workitem_id dimension of work-group for 2233 > 0) wavefront lane. 2234 then Work-Item Id Z 1 32 bit work item id in Z 2235 (enable_vgpr_workitem_id dimension of work-group for 2236 > 1) wavefront lane. 2237 ========== ========================== ====== ============================== 2238 2239The setting of registers is is done by GPU CP/ADC/SPI hardware as follows: 2240 22411. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 2242 registers. 22432. Work-group Id registers X, Y, Z are set by ADC which supports any 2244 combination including none. 22453. Scratch Wave Offset is set by SPI in a per wave basis which is why its value 2246 cannot included with the flat scratch init value which is per queue. 22474. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 2248 or (X, Y, Z). 2249 2250Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit 2251value to the hardware required SGPRn-3 and SGPRn-4 respectively. 2252 2253The global segment can be accessed either using buffer instructions (GFX6 which 2254has V# 64 bit address support), flat instructions (GFX7-GFX9), or global 2255instructions (GFX9). 2256 2257If buffer operations are used then the compiler can generate a V# with the 2258following properties: 2259 2260* base address of 0 2261* no swizzle 2262* ATC: 1 if IOMMU present (such as APU) 2263* ptr64: 1 2264* MTYPE set to support memory coherence that matches the runtime (such as CC for 2265 APU and NC for dGPU). 2266 2267.. _amdgpu-amdhsa-kernel-prolog: 2268 2269Kernel Prolog 2270~~~~~~~~~~~~~ 2271 2272.. _amdgpu-amdhsa-m0: 2273 2274M0 2275++ 2276 2277GFX6-GFX8 2278 The M0 register must be initialized with a value at least the total LDS size 2279 if the kernel may access LDS via DS or flat operations. Total LDS size is 2280 available in dispatch packet. For M0, it is also possible to use maximum 2281 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 2282 GFX7-GFX8). 2283GFX9 2284 The M0 register is not used for range checking LDS accesses and so does not 2285 need to be initialized in the prolog. 2286 2287.. _amdgpu-amdhsa-flat-scratch: 2288 2289Flat Scratch 2290++++++++++++ 2291 2292If the kernel may use flat operations to access scratch memory, the prolog code 2293must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which 2294are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave 2295Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 2296 2297GFX6 2298 Flat scratch is not supported. 2299 2300GFX7-GFX8 2301 1. The low word of Flat Scratch Init is 32 bit byte offset from 2302 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 2303 being managed by SPI for the queue executing the kernel dispatch. This is 2304 the same value used in the Scratch Segment Buffer V# base address. The 2305 prolog must add the value of Scratch Wave Offset to get the wave's byte 2306 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since 2307 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted 2308 by 8 before moving into FLAT_SCRATCH_LO. 2309 2. The second word of Flat Scratch Init is 32 bit byte size of a single 2310 work-items scratch memory usage. This is directly loaded from the kernel 2311 dispatch packet Private Segment Byte Size and rounded up to a multiple of 2312 DWORD. Having CP load it once avoids loading it at the beginning of every 2313 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH 2314 SIZE. 2315 2316GFX9 2317 The Flat Scratch Init is the 64 bit address of the base of scratch backing 2318 memory being managed by SPI for the queue executing the kernel dispatch. The 2319 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH 2320 pair for use as the flat scratch base in flat memory instructions. 2321 2322.. _amdgpu-amdhsa-memory-model: 2323 2324Memory Model 2325~~~~~~~~~~~~ 2326 2327This section describes the mapping of LLVM memory model onto AMDGPU machine code 2328(see :ref:`memmodel`). *The implementation is WIP.* 2329 2330.. TODO 2331 Update when implementation complete. 2332 2333The AMDGPU backend supports the memory synchronization scopes specified in 2334:ref:`amdgpu-memory-scopes`. 2335 2336The code sequences used to implement the memory model are defined in table 2337:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 2338 2339The sequences specify the order of instructions that a single thread must 2340execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect 2341to other memory instructions executed by the same thread. This allows them to be 2342moved earlier or later which can allow them to be combined with other instances 2343of the same instruction, or hoisted/sunk out of loops to improve 2344performance. Only the instructions related to the memory model are given; 2345additional ``s_waitcnt`` instructions are required to ensure registers are 2346defined before being used. These may be able to be combined with the memory 2347model ``s_waitcnt`` instructions as described above. 2348 2349The AMDGPU backend supports the following memory models: 2350 2351 HSA Memory Model [HSA]_ 2352 The HSA memory model uses a single happens-before relation for all address 2353 spaces (see :ref:`amdgpu-address-spaces`). 2354 OpenCL Memory Model [OpenCL]_ 2355 The OpenCL memory model which has separate happens-before relations for the 2356 global and local address spaces. Only a fence specifying both global and 2357 local address space, and seq_cst instructions join the relationships. Since 2358 the LLVM ``memfence`` instruction does not allow an address space to be 2359 specified the OpenCL fence has to convervatively assume both local and 2360 global address space was specified. However, optimizations can often be 2361 done to eliminate the additional ``s_waitcnt`` instructions when there are 2362 no intervening memory instructions which access the corresponding address 2363 space. The code sequences in the table indicate what can be omitted for the 2364 OpenCL memory. The target triple environment is used to determine if the 2365 source language is OpenCL (see :ref:`amdgpu-opencl`). 2366 2367``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 2368operations. 2369 2370``buffer/global/flat_load/store/atomic`` instructions to global memory are 2371termed vector memory operations. 2372 2373For GFX6-GFX9: 2374 2375* Each agent has multiple compute units (CU). 2376* Each CU has multiple SIMDs that execute wavefronts. 2377* The wavefronts for a single work-group are executed in the same CU but may be 2378 executed by different SIMDs. 2379* Each CU has a single LDS memory shared by the wavefronts of the work-groups 2380 executing on it. 2381* All LDS operations of a CU are performed as wavefront wide operations in a 2382 global order and involve no caching. Completion is reported to a wavefront in 2383 execution order. 2384* The LDS memory has multiple request queues shared by the SIMDs of a 2385 CU. Therefore, the LDS operations performed by different waves of a work-group 2386 can be reordered relative to each other, which can result in reordering the 2387 visibility of vector memory operations with respect to LDS operations of other 2388 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to 2389 ensure synchronization between LDS operations and vector memory operations 2390 between waves of a work-group, but not between operations performed by the 2391 same wavefront. 2392* The vector memory operations are performed as wavefront wide operations and 2393 completion is reported to a wavefront in execution order. The exception is 2394 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 2395 vector memory order if they access LDS memory, and out of LDS operation order 2396 if they access global memory. 2397* The vector memory operations access a single vector L1 cache shared by all 2398 SIMDs a CU. Therefore, no special action is required for coherence between the 2399 lanes of a single wavefront, or for coherence between wavefronts in the same 2400 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves 2401 executing in different work-groups as they may be executing on different CUs. 2402* The scalar memory operations access a scalar L1 cache shared by all wavefronts 2403 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 2404 scalar operations are used in a restricted way so do not impact the memory 2405 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 2406* The vector and scalar memory operations use an L2 cache shared by all CUs on 2407 the same agent. 2408* The L2 cache has independent channels to service disjoint ranges of virtual 2409 addresses. 2410* Each CU has a separate request queue per channel. Therefore, the vector and 2411 scalar memory operations performed by waves executing in different work-groups 2412 (which may be executing on different CUs) of an agent can be reordered 2413 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 2414 synchronization between vector memory operations of different CUs. It ensures a 2415 previous vector memory operation has completed before executing a subsequent 2416 vector memory or LDS operation and so can be used to meet the requirements of 2417 acquire and release. 2418* The L2 cache can be kept coherent with other agents on some targets, or ranges 2419 of virtual addresses can be set up to bypass it to ensure system coherence. 2420 2421Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), 2422or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the 2423memory, atomic memory orderings are not meaningful and all accesses are treated 2424as non-atomic. 2425 2426Constant address space uses ``buffer/global_load`` instructions (or equivalent 2427scalar memory instructions). Since the constant address space contents do not 2428change during the execution of a kernel dispatch it is not legal to perform 2429stores, and atomic memory orderings are not meaningful and all access are 2430treated as non-atomic. 2431 2432A memory synchronization scope wider than work-group is not meaningful for the 2433group (LDS) address space and is treated as work-group. 2434 2435The memory model does not support the region address space which is treated as 2436non-atomic. 2437 2438Acquire memory ordering is not meaningful on store atomic instructions and is 2439treated as non-atomic. 2440 2441Release memory ordering is not meaningful on load atomic instructions and is 2442treated a non-atomic. 2443 2444Acquire-release memory ordering is not meaningful on load or store atomic 2445instructions and is treated as acquire and release respectively. 2446 2447AMDGPU backend only uses scalar memory operations to access memory that is 2448proven to not change during the execution of the kernel dispatch. This includes 2449constant address space and global address space for program scope const 2450variables. Therefore the kernel machine code does not have to maintain the 2451scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar 2452and vector L1 caches are invalidated between kernel dispatches by CP since 2453constant address space data may change between kernel dispatch executions. See 2454:ref:`amdgpu-amdhsa-memory-spaces`. 2455 2456The one execption is if scalar writes are used to spill SGPR registers. In this 2457case the AMDGPU backend ensures the memory location used to spill is never 2458accessed by vector memory operations at the same time. If scalar writes are used 2459then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 2460return since the locations may be used for vector memory instructions by a 2461future wave that uses the same scratch area, or a function call that creates a 2462frame at the same address, respectively. There is no need for a ``s_dcache_inv`` 2463as all scalar writes are write-before-read in the same thread. 2464 2465Scratch backing memory (which is used for the private address space) 2466is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private 2467address space is only accessed by a single thread, and is always 2468write-before-read, there is never a need to invalidate these entries from the L1 2469cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the 2470volatile cache lines. 2471 2472On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing 2473to invalidate the L2 cache. This also causes it to be treated as 2474non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC 2475(cache coherent) and so the L2 cache will coherent with the CPU and other 2476agents. 2477 2478 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 2479 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 2480 2481 ============ ============ ============== ========== =============================== 2482 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 2483 Ordering Sync Scope Address 2484 Space 2485 ============ ============ ============== ========== =============================== 2486 **Non-Atomic** 2487 ----------------------------------------------------------------------------------- 2488 load *none* *none* - global - !volatile & !nontemporal 2489 - generic 2490 - private 1. buffer/global/flat_load 2491 - constant 2492 - volatile & !nontemporal 2493 2494 1. buffer/global/flat_load 2495 glc=1 2496 2497 - nontemporal 2498 2499 1. buffer/global/flat_load 2500 glc=1 slc=1 2501 2502 load *none* *none* - local 1. ds_load 2503 store *none* *none* - global - !nontemporal 2504 - generic 2505 - private 1. buffer/global/flat_store 2506 - constant 2507 - nontemporal 2508 2509 1. buffer/global/flat_stote 2510 glc=1 slc=1 2511 2512 store *none* *none* - local 1. ds_store 2513 **Unordered Atomic** 2514 ----------------------------------------------------------------------------------- 2515 load atomic unordered *any* *any* *Same as non-atomic*. 2516 store atomic unordered *any* *any* *Same as non-atomic*. 2517 atomicrmw unordered *any* *any* *Same as monotonic 2518 atomic*. 2519 **Monotonic Atomic** 2520 ----------------------------------------------------------------------------------- 2521 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 2522 - wavefront - generic 2523 - workgroup 2524 load atomic monotonic - singlethread - local 1. ds_load 2525 - wavefront 2526 - workgroup 2527 load atomic monotonic - agent - global 1. buffer/global/flat_load 2528 - system - generic glc=1 2529 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 2530 - wavefront - generic 2531 - workgroup 2532 - agent 2533 - system 2534 store atomic monotonic - singlethread - local 1. ds_store 2535 - wavefront 2536 - workgroup 2537 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 2538 - wavefront - generic 2539 - workgroup 2540 - agent 2541 - system 2542 atomicrmw monotonic - singlethread - local 1. ds_atomic 2543 - wavefront 2544 - workgroup 2545 **Acquire Atomic** 2546 ----------------------------------------------------------------------------------- 2547 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 2548 - wavefront - local 2549 - generic 2550 load atomic acquire - workgroup - global 1. buffer/global/flat_load 2551 load atomic acquire - workgroup - local 1. ds_load 2552 2. s_waitcnt lgkmcnt(0) 2553 2554 - If OpenCL, omit. 2555 - Must happen before 2556 any following 2557 global/generic 2558 load/load 2559 atomic/store/store 2560 atomic/atomicrmw. 2561 - Ensures any 2562 following global 2563 data read is no 2564 older than the load 2565 atomic value being 2566 acquired. 2567 load atomic acquire - workgroup - generic 1. flat_load 2568 2. s_waitcnt lgkmcnt(0) 2569 2570 - If OpenCL, omit. 2571 - Must happen before 2572 any following 2573 global/generic 2574 load/load 2575 atomic/store/store 2576 atomic/atomicrmw. 2577 - Ensures any 2578 following global 2579 data read is no 2580 older than the load 2581 atomic value being 2582 acquired. 2583 load atomic acquire - agent - global 1. buffer/global/flat_load 2584 - system glc=1 2585 2. s_waitcnt vmcnt(0) 2586 2587 - Must happen before 2588 following 2589 buffer_wbinvl1_vol. 2590 - Ensures the load 2591 has completed 2592 before invalidating 2593 the cache. 2594 2595 3. buffer_wbinvl1_vol 2596 2597 - Must happen before 2598 any following 2599 global/generic 2600 load/load 2601 atomic/atomicrmw. 2602 - Ensures that 2603 following 2604 loads will not see 2605 stale global data. 2606 2607 load atomic acquire - agent - generic 1. flat_load glc=1 2608 - system 2. s_waitcnt vmcnt(0) & 2609 lgkmcnt(0) 2610 2611 - If OpenCL omit 2612 lgkmcnt(0). 2613 - Must happen before 2614 following 2615 buffer_wbinvl1_vol. 2616 - Ensures the flat_load 2617 has completed 2618 before invalidating 2619 the cache. 2620 2621 3. buffer_wbinvl1_vol 2622 2623 - Must happen before 2624 any following 2625 global/generic 2626 load/load 2627 atomic/atomicrmw. 2628 - Ensures that 2629 following loads 2630 will not see stale 2631 global data. 2632 2633 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 2634 - wavefront - local 2635 - generic 2636 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 2637 atomicrmw acquire - workgroup - local 1. ds_atomic 2638 2. waitcnt lgkmcnt(0) 2639 2640 - If OpenCL, omit. 2641 - Must happen before 2642 any following 2643 global/generic 2644 load/load 2645 atomic/store/store 2646 atomic/atomicrmw. 2647 - Ensures any 2648 following global 2649 data read is no 2650 older than the 2651 atomicrmw value 2652 being acquired. 2653 2654 atomicrmw acquire - workgroup - generic 1. flat_atomic 2655 2. waitcnt lgkmcnt(0) 2656 2657 - If OpenCL, omit. 2658 - Must happen before 2659 any following 2660 global/generic 2661 load/load 2662 atomic/store/store 2663 atomic/atomicrmw. 2664 - Ensures any 2665 following global 2666 data read is no 2667 older than the 2668 atomicrmw value 2669 being acquired. 2670 2671 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 2672 - system 2. s_waitcnt vmcnt(0) 2673 2674 - Must happen before 2675 following 2676 buffer_wbinvl1_vol. 2677 - Ensures the 2678 atomicrmw has 2679 completed before 2680 invalidating the 2681 cache. 2682 2683 3. buffer_wbinvl1_vol 2684 2685 - Must happen before 2686 any following 2687 global/generic 2688 load/load 2689 atomic/atomicrmw. 2690 - Ensures that 2691 following loads 2692 will not see stale 2693 global data. 2694 2695 atomicrmw acquire - agent - generic 1. flat_atomic 2696 - system 2. s_waitcnt vmcnt(0) & 2697 lgkmcnt(0) 2698 2699 - If OpenCL, omit 2700 lgkmcnt(0). 2701 - Must happen before 2702 following 2703 buffer_wbinvl1_vol. 2704 - Ensures the 2705 atomicrmw has 2706 completed before 2707 invalidating the 2708 cache. 2709 2710 3. buffer_wbinvl1_vol 2711 2712 - Must happen before 2713 any following 2714 global/generic 2715 load/load 2716 atomic/atomicrmw. 2717 - Ensures that 2718 following loads 2719 will not see stale 2720 global data. 2721 2722 fence acquire - singlethread *none* *none* 2723 - wavefront 2724 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 2725 2726 - If OpenCL and 2727 address space is 2728 not generic, omit. 2729 - However, since LLVM 2730 currently has no 2731 address space on 2732 the fence need to 2733 conservatively 2734 always generate. If 2735 fence had an 2736 address space then 2737 set to address 2738 space of OpenCL 2739 fence flag, or to 2740 generic if both 2741 local and global 2742 flags are 2743 specified. 2744 - Must happen after 2745 any preceding 2746 local/generic load 2747 atomic/atomicrmw 2748 with an equal or 2749 wider sync scope 2750 and memory ordering 2751 stronger than 2752 unordered (this is 2753 termed the 2754 fence-paired-atomic). 2755 - Must happen before 2756 any following 2757 global/generic 2758 load/load 2759 atomic/store/store 2760 atomic/atomicrmw. 2761 - Ensures any 2762 following global 2763 data read is no 2764 older than the 2765 value read by the 2766 fence-paired-atomic. 2767 2768 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 2769 - system vmcnt(0) 2770 2771 - If OpenCL and 2772 address space is 2773 not generic, omit 2774 lgkmcnt(0). 2775 - However, since LLVM 2776 currently has no 2777 address space on 2778 the fence need to 2779 conservatively 2780 always generate 2781 (see comment for 2782 previous fence). 2783 - Could be split into 2784 separate s_waitcnt 2785 vmcnt(0) and 2786 s_waitcnt 2787 lgkmcnt(0) to allow 2788 them to be 2789 independently moved 2790 according to the 2791 following rules. 2792 - s_waitcnt vmcnt(0) 2793 must happen after 2794 any preceding 2795 global/generic load 2796 atomic/atomicrmw 2797 with an equal or 2798 wider sync scope 2799 and memory ordering 2800 stronger than 2801 unordered (this is 2802 termed the 2803 fence-paired-atomic). 2804 - s_waitcnt lgkmcnt(0) 2805 must happen after 2806 any preceding 2807 local/generic load 2808 atomic/atomicrmw 2809 with an equal or 2810 wider sync scope 2811 and memory ordering 2812 stronger than 2813 unordered (this is 2814 termed the 2815 fence-paired-atomic). 2816 - Must happen before 2817 the following 2818 buffer_wbinvl1_vol. 2819 - Ensures that the 2820 fence-paired atomic 2821 has completed 2822 before invalidating 2823 the 2824 cache. Therefore 2825 any following 2826 locations read must 2827 be no older than 2828 the value read by 2829 the 2830 fence-paired-atomic. 2831 2832 2. buffer_wbinvl1_vol 2833 2834 - Must happen before any 2835 following global/generic 2836 load/load 2837 atomic/store/store 2838 atomic/atomicrmw. 2839 - Ensures that 2840 following loads 2841 will not see stale 2842 global data. 2843 2844 **Release Atomic** 2845 ----------------------------------------------------------------------------------- 2846 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 2847 - wavefront - local 2848 - generic 2849 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 2850 2851 - If OpenCL, omit. 2852 - Must happen after 2853 any preceding 2854 local/generic 2855 load/store/load 2856 atomic/store 2857 atomic/atomicrmw. 2858 - Must happen before 2859 the following 2860 store. 2861 - Ensures that all 2862 memory operations 2863 to local have 2864 completed before 2865 performing the 2866 store that is being 2867 released. 2868 2869 2. buffer/global/flat_store 2870 store atomic release - workgroup - local 1. ds_store 2871 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 2872 2873 - If OpenCL, omit. 2874 - Must happen after 2875 any preceding 2876 local/generic 2877 load/store/load 2878 atomic/store 2879 atomic/atomicrmw. 2880 - Must happen before 2881 the following 2882 store. 2883 - Ensures that all 2884 memory operations 2885 to local have 2886 completed before 2887 performing the 2888 store that is being 2889 released. 2890 2891 2. flat_store 2892 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 2893 - system - generic vmcnt(0) 2894 2895 - If OpenCL, omit 2896 lgkmcnt(0). 2897 - Could be split into 2898 separate s_waitcnt 2899 vmcnt(0) and 2900 s_waitcnt 2901 lgkmcnt(0) to allow 2902 them to be 2903 independently moved 2904 according to the 2905 following rules. 2906 - s_waitcnt vmcnt(0) 2907 must happen after 2908 any preceding 2909 global/generic 2910 load/store/load 2911 atomic/store 2912 atomic/atomicrmw. 2913 - s_waitcnt lgkmcnt(0) 2914 must happen after 2915 any preceding 2916 local/generic 2917 load/store/load 2918 atomic/store 2919 atomic/atomicrmw. 2920 - Must happen before 2921 the following 2922 store. 2923 - Ensures that all 2924 memory operations 2925 to memory have 2926 completed before 2927 performing the 2928 store that is being 2929 released. 2930 2931 2. buffer/global/ds/flat_store 2932 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 2933 - wavefront - local 2934 - generic 2935 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 2936 2937 - If OpenCL, omit. 2938 - Must happen after 2939 any preceding 2940 local/generic 2941 load/store/load 2942 atomic/store 2943 atomic/atomicrmw. 2944 - Must happen before 2945 the following 2946 atomicrmw. 2947 - Ensures that all 2948 memory operations 2949 to local have 2950 completed before 2951 performing the 2952 atomicrmw that is 2953 being released. 2954 2955 2. buffer/global/flat_atomic 2956 atomicrmw release - workgroup - local 1. ds_atomic 2957 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 2958 2959 - If OpenCL, omit. 2960 - Must happen after 2961 any preceding 2962 local/generic 2963 load/store/load 2964 atomic/store 2965 atomic/atomicrmw. 2966 - Must happen before 2967 the following 2968 atomicrmw. 2969 - Ensures that all 2970 memory operations 2971 to local have 2972 completed before 2973 performing the 2974 atomicrmw that is 2975 being released. 2976 2977 2. flat_atomic 2978 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 2979 - system - generic vmcnt(0) 2980 2981 - If OpenCL, omit 2982 lgkmcnt(0). 2983 - Could be split into 2984 separate s_waitcnt 2985 vmcnt(0) and 2986 s_waitcnt 2987 lgkmcnt(0) to allow 2988 them to be 2989 independently moved 2990 according to the 2991 following rules. 2992 - s_waitcnt vmcnt(0) 2993 must happen after 2994 any preceding 2995 global/generic 2996 load/store/load 2997 atomic/store 2998 atomic/atomicrmw. 2999 - s_waitcnt lgkmcnt(0) 3000 must happen after 3001 any preceding 3002 local/generic 3003 load/store/load 3004 atomic/store 3005 atomic/atomicrmw. 3006 - Must happen before 3007 the following 3008 atomicrmw. 3009 - Ensures that all 3010 memory operations 3011 to global and local 3012 have completed 3013 before performing 3014 the atomicrmw that 3015 is being released. 3016 3017 2. buffer/global/ds/flat_atomic 3018 fence release - singlethread *none* *none* 3019 - wavefront 3020 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 3021 3022 - If OpenCL and 3023 address space is 3024 not generic, omit. 3025 - However, since LLVM 3026 currently has no 3027 address space on 3028 the fence need to 3029 conservatively 3030 always generate. If 3031 fence had an 3032 address space then 3033 set to address 3034 space of OpenCL 3035 fence flag, or to 3036 generic if both 3037 local and global 3038 flags are 3039 specified. 3040 - Must happen after 3041 any preceding 3042 local/generic 3043 load/load 3044 atomic/store/store 3045 atomic/atomicrmw. 3046 - Must happen before 3047 any following store 3048 atomic/atomicrmw 3049 with an equal or 3050 wider sync scope 3051 and memory ordering 3052 stronger than 3053 unordered (this is 3054 termed the 3055 fence-paired-atomic). 3056 - Ensures that all 3057 memory operations 3058 to local have 3059 completed before 3060 performing the 3061 following 3062 fence-paired-atomic. 3063 3064 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 3065 - system vmcnt(0) 3066 3067 - If OpenCL and 3068 address space is 3069 not generic, omit 3070 lgkmcnt(0). 3071 - If OpenCL and 3072 address space is 3073 local, omit 3074 vmcnt(0). 3075 - However, since LLVM 3076 currently has no 3077 address space on 3078 the fence need to 3079 conservatively 3080 always generate. If 3081 fence had an 3082 address space then 3083 set to address 3084 space of OpenCL 3085 fence flag, or to 3086 generic if both 3087 local and global 3088 flags are 3089 specified. 3090 - Could be split into 3091 separate s_waitcnt 3092 vmcnt(0) and 3093 s_waitcnt 3094 lgkmcnt(0) to allow 3095 them to be 3096 independently moved 3097 according to the 3098 following rules. 3099 - s_waitcnt vmcnt(0) 3100 must happen after 3101 any preceding 3102 global/generic 3103 load/store/load 3104 atomic/store 3105 atomic/atomicrmw. 3106 - s_waitcnt lgkmcnt(0) 3107 must happen after 3108 any preceding 3109 local/generic 3110 load/store/load 3111 atomic/store 3112 atomic/atomicrmw. 3113 - Must happen before 3114 any following store 3115 atomic/atomicrmw 3116 with an equal or 3117 wider sync scope 3118 and memory ordering 3119 stronger than 3120 unordered (this is 3121 termed the 3122 fence-paired-atomic). 3123 - Ensures that all 3124 memory operations 3125 have 3126 completed before 3127 performing the 3128 following 3129 fence-paired-atomic. 3130 3131 **Acquire-Release Atomic** 3132 ----------------------------------------------------------------------------------- 3133 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 3134 - wavefront - local 3135 - generic 3136 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 3137 3138 - If OpenCL, omit. 3139 - Must happen after 3140 any preceding 3141 local/generic 3142 load/store/load 3143 atomic/store 3144 atomic/atomicrmw. 3145 - Must happen before 3146 the following 3147 atomicrmw. 3148 - Ensures that all 3149 memory operations 3150 to local have 3151 completed before 3152 performing the 3153 atomicrmw that is 3154 being released. 3155 3156 2. buffer/global/flat_atomic 3157 atomicrmw acq_rel - workgroup - local 1. ds_atomic 3158 2. s_waitcnt lgkmcnt(0) 3159 3160 - If OpenCL, omit. 3161 - Must happen before 3162 any following 3163 global/generic 3164 load/load 3165 atomic/store/store 3166 atomic/atomicrmw. 3167 - Ensures any 3168 following global 3169 data read is no 3170 older than the load 3171 atomic value being 3172 acquired. 3173 3174 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 3175 3176 - If OpenCL, omit. 3177 - Must happen after 3178 any preceding 3179 local/generic 3180 load/store/load 3181 atomic/store 3182 atomic/atomicrmw. 3183 - Must happen before 3184 the following 3185 atomicrmw. 3186 - Ensures that all 3187 memory operations 3188 to local have 3189 completed before 3190 performing the 3191 atomicrmw that is 3192 being released. 3193 3194 2. flat_atomic 3195 3. s_waitcnt lgkmcnt(0) 3196 3197 - If OpenCL, omit. 3198 - Must happen before 3199 any following 3200 global/generic 3201 load/load 3202 atomic/store/store 3203 atomic/atomicrmw. 3204 - Ensures any 3205 following global 3206 data read is no 3207 older than the load 3208 atomic value being 3209 acquired. 3210 3211 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 3212 - system vmcnt(0) 3213 3214 - If OpenCL, omit 3215 lgkmcnt(0). 3216 - Could be split into 3217 separate s_waitcnt 3218 vmcnt(0) and 3219 s_waitcnt 3220 lgkmcnt(0) to allow 3221 them to be 3222 independently moved 3223 according to the 3224 following rules. 3225 - s_waitcnt vmcnt(0) 3226 must happen after 3227 any preceding 3228 global/generic 3229 load/store/load 3230 atomic/store 3231 atomic/atomicrmw. 3232 - s_waitcnt lgkmcnt(0) 3233 must happen after 3234 any preceding 3235 local/generic 3236 load/store/load 3237 atomic/store 3238 atomic/atomicrmw. 3239 - Must happen before 3240 the following 3241 atomicrmw. 3242 - Ensures that all 3243 memory operations 3244 to global have 3245 completed before 3246 performing the 3247 atomicrmw that is 3248 being released. 3249 3250 2. buffer/global/flat_atomic 3251 3. s_waitcnt vmcnt(0) 3252 3253 - Must happen before 3254 following 3255 buffer_wbinvl1_vol. 3256 - Ensures the 3257 atomicrmw has 3258 completed before 3259 invalidating the 3260 cache. 3261 3262 4. buffer_wbinvl1_vol 3263 3264 - Must happen before 3265 any following 3266 global/generic 3267 load/load 3268 atomic/atomicrmw. 3269 - Ensures that 3270 following loads 3271 will not see stale 3272 global data. 3273 3274 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 3275 - system vmcnt(0) 3276 3277 - If OpenCL, omit 3278 lgkmcnt(0). 3279 - Could be split into 3280 separate s_waitcnt 3281 vmcnt(0) and 3282 s_waitcnt 3283 lgkmcnt(0) to allow 3284 them to be 3285 independently moved 3286 according to the 3287 following rules. 3288 - s_waitcnt vmcnt(0) 3289 must happen after 3290 any preceding 3291 global/generic 3292 load/store/load 3293 atomic/store 3294 atomic/atomicrmw. 3295 - s_waitcnt lgkmcnt(0) 3296 must happen after 3297 any preceding 3298 local/generic 3299 load/store/load 3300 atomic/store 3301 atomic/atomicrmw. 3302 - Must happen before 3303 the following 3304 atomicrmw. 3305 - Ensures that all 3306 memory operations 3307 to global have 3308 completed before 3309 performing the 3310 atomicrmw that is 3311 being released. 3312 3313 2. flat_atomic 3314 3. s_waitcnt vmcnt(0) & 3315 lgkmcnt(0) 3316 3317 - If OpenCL, omit 3318 lgkmcnt(0). 3319 - Must happen before 3320 following 3321 buffer_wbinvl1_vol. 3322 - Ensures the 3323 atomicrmw has 3324 completed before 3325 invalidating the 3326 cache. 3327 3328 4. buffer_wbinvl1_vol 3329 3330 - Must happen before 3331 any following 3332 global/generic 3333 load/load 3334 atomic/atomicrmw. 3335 - Ensures that 3336 following loads 3337 will not see stale 3338 global data. 3339 3340 fence acq_rel - singlethread *none* *none* 3341 - wavefront 3342 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 3343 3344 - If OpenCL and 3345 address space is 3346 not generic, omit. 3347 - However, 3348 since LLVM 3349 currently has no 3350 address space on 3351 the fence need to 3352 conservatively 3353 always generate 3354 (see comment for 3355 previous fence). 3356 - Must happen after 3357 any preceding 3358 local/generic 3359 load/load 3360 atomic/store/store 3361 atomic/atomicrmw. 3362 - Must happen before 3363 any following 3364 global/generic 3365 load/load 3366 atomic/store/store 3367 atomic/atomicrmw. 3368 - Ensures that all 3369 memory operations 3370 to local have 3371 completed before 3372 performing any 3373 following global 3374 memory operations. 3375 - Ensures that the 3376 preceding 3377 local/generic load 3378 atomic/atomicrmw 3379 with an equal or 3380 wider sync scope 3381 and memory ordering 3382 stronger than 3383 unordered (this is 3384 termed the 3385 acquire-fence-paired-atomic 3386 ) has completed 3387 before following 3388 global memory 3389 operations. This 3390 satisfies the 3391 requirements of 3392 acquire. 3393 - Ensures that all 3394 previous memory 3395 operations have 3396 completed before a 3397 following 3398 local/generic store 3399 atomic/atomicrmw 3400 with an equal or 3401 wider sync scope 3402 and memory ordering 3403 stronger than 3404 unordered (this is 3405 termed the 3406 release-fence-paired-atomic 3407 ). This satisfies the 3408 requirements of 3409 release. 3410 3411 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 3412 - system vmcnt(0) 3413 3414 - If OpenCL and 3415 address space is 3416 not generic, omit 3417 lgkmcnt(0). 3418 - However, since LLVM 3419 currently has no 3420 address space on 3421 the fence need to 3422 conservatively 3423 always generate 3424 (see comment for 3425 previous fence). 3426 - Could be split into 3427 separate s_waitcnt 3428 vmcnt(0) and 3429 s_waitcnt 3430 lgkmcnt(0) to allow 3431 them to be 3432 independently moved 3433 according to the 3434 following rules. 3435 - s_waitcnt vmcnt(0) 3436 must happen after 3437 any preceding 3438 global/generic 3439 load/store/load 3440 atomic/store 3441 atomic/atomicrmw. 3442 - s_waitcnt lgkmcnt(0) 3443 must happen after 3444 any preceding 3445 local/generic 3446 load/store/load 3447 atomic/store 3448 atomic/atomicrmw. 3449 - Must happen before 3450 the following 3451 buffer_wbinvl1_vol. 3452 - Ensures that the 3453 preceding 3454 global/local/generic 3455 load 3456 atomic/atomicrmw 3457 with an equal or 3458 wider sync scope 3459 and memory ordering 3460 stronger than 3461 unordered (this is 3462 termed the 3463 acquire-fence-paired-atomic 3464 ) has completed 3465 before invalidating 3466 the cache. This 3467 satisfies the 3468 requirements of 3469 acquire. 3470 - Ensures that all 3471 previous memory 3472 operations have 3473 completed before a 3474 following 3475 global/local/generic 3476 store 3477 atomic/atomicrmw 3478 with an equal or 3479 wider sync scope 3480 and memory ordering 3481 stronger than 3482 unordered (this is 3483 termed the 3484 release-fence-paired-atomic 3485 ). This satisfies the 3486 requirements of 3487 release. 3488 3489 2. buffer_wbinvl1_vol 3490 3491 - Must happen before 3492 any following 3493 global/generic 3494 load/load 3495 atomic/store/store 3496 atomic/atomicrmw. 3497 - Ensures that 3498 following loads 3499 will not see stale 3500 global data. This 3501 satisfies the 3502 requirements of 3503 acquire. 3504 3505 **Sequential Consistent Atomic** 3506 ----------------------------------------------------------------------------------- 3507 load atomic seq_cst - singlethread - global *Same as corresponding 3508 - wavefront - local load atomic acquire, 3509 - generic except must generated 3510 all instructions even 3511 for OpenCL.* 3512 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 3513 - generic 3514 - Must 3515 happen after 3516 preceding 3517 global/generic load 3518 atomic/store 3519 atomic/atomicrmw 3520 with memory 3521 ordering of seq_cst 3522 and with equal or 3523 wider sync scope. 3524 (Note that seq_cst 3525 fences have their 3526 own s_waitcnt 3527 lgkmcnt(0) and so do 3528 not need to be 3529 considered.) 3530 - Ensures any 3531 preceding 3532 sequential 3533 consistent local 3534 memory instructions 3535 have completed 3536 before executing 3537 this sequentially 3538 consistent 3539 instruction. This 3540 prevents reordering 3541 a seq_cst store 3542 followed by a 3543 seq_cst load. (Note 3544 that seq_cst is 3545 stronger than 3546 acquire/release as 3547 the reordering of 3548 load acquire 3549 followed by a store 3550 release is 3551 prevented by the 3552 waitcnt of 3553 the release, but 3554 there is nothing 3555 preventing a store 3556 release followed by 3557 load acquire from 3558 competing out of 3559 order.) 3560 3561 2. *Following 3562 instructions same as 3563 corresponding load 3564 atomic acquire, 3565 except must generated 3566 all instructions even 3567 for OpenCL.* 3568 load atomic seq_cst - workgroup - local *Same as corresponding 3569 load atomic acquire, 3570 except must generated 3571 all instructions even 3572 for OpenCL.* 3573 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 3574 - system - generic vmcnt(0) 3575 3576 - Could be split into 3577 separate s_waitcnt 3578 vmcnt(0) 3579 and s_waitcnt 3580 lgkmcnt(0) to allow 3581 them to be 3582 independently moved 3583 according to the 3584 following rules. 3585 - waitcnt lgkmcnt(0) 3586 must happen after 3587 preceding 3588 global/generic load 3589 atomic/store 3590 atomic/atomicrmw 3591 with memory 3592 ordering of seq_cst 3593 and with equal or 3594 wider sync scope. 3595 (Note that seq_cst 3596 fences have their 3597 own s_waitcnt 3598 lgkmcnt(0) and so do 3599 not need to be 3600 considered.) 3601 - waitcnt vmcnt(0) 3602 must happen after 3603 preceding 3604 global/generic load 3605 atomic/store 3606 atomic/atomicrmw 3607 with memory 3608 ordering of seq_cst 3609 and with equal or 3610 wider sync scope. 3611 (Note that seq_cst 3612 fences have their 3613 own s_waitcnt 3614 vmcnt(0) and so do 3615 not need to be 3616 considered.) 3617 - Ensures any 3618 preceding 3619 sequential 3620 consistent global 3621 memory instructions 3622 have completed 3623 before executing 3624 this sequentially 3625 consistent 3626 instruction. This 3627 prevents reordering 3628 a seq_cst store 3629 followed by a 3630 seq_cst load. (Note 3631 that seq_cst is 3632 stronger than 3633 acquire/release as 3634 the reordering of 3635 load acquire 3636 followed by a store 3637 release is 3638 prevented by the 3639 waitcnt of 3640 the release, but 3641 there is nothing 3642 preventing a store 3643 release followed by 3644 load acquire from 3645 competing out of 3646 order.) 3647 3648 2. *Following 3649 instructions same as 3650 corresponding load 3651 atomic acquire, 3652 except must generated 3653 all instructions even 3654 for OpenCL.* 3655 store atomic seq_cst - singlethread - global *Same as corresponding 3656 - wavefront - local store atomic release, 3657 - workgroup - generic except must generated 3658 all instructions even 3659 for OpenCL.* 3660 store atomic seq_cst - agent - global *Same as corresponding 3661 - system - generic store atomic release, 3662 except must generated 3663 all instructions even 3664 for OpenCL.* 3665 atomicrmw seq_cst - singlethread - global *Same as corresponding 3666 - wavefront - local atomicrmw acq_rel, 3667 - workgroup - generic except must generated 3668 all instructions even 3669 for OpenCL.* 3670 atomicrmw seq_cst - agent - global *Same as corresponding 3671 - system - generic atomicrmw acq_rel, 3672 except must generated 3673 all instructions even 3674 for OpenCL.* 3675 fence seq_cst - singlethread *none* *Same as corresponding 3676 - wavefront fence acq_rel, 3677 - workgroup except must generated 3678 - agent all instructions even 3679 - system for OpenCL.* 3680 ============ ============ ============== ========== =============================== 3681 3682The memory order also adds the single thread optimization constrains defined in 3683table 3684:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`. 3685 3686 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9 3687 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table 3688 3689 ============ ============================================================== 3690 LLVM Memory Optimization Constraints 3691 Ordering 3692 ============ ============================================================== 3693 unordered *none* 3694 monotonic *none* 3695 acquire - If a load atomic/atomicrmw then no following load/load 3696 atomic/store/ store atomic/atomicrmw/fence instruction can 3697 be moved before the acquire. 3698 - If a fence then same as load atomic, plus no preceding 3699 associated fence-paired-atomic can be moved after the fence. 3700 release - If a store atomic/atomicrmw then no preceding load/load 3701 atomic/store/ store atomic/atomicrmw/fence instruction can 3702 be moved after the release. 3703 - If a fence then same as store atomic, plus no following 3704 associated fence-paired-atomic can be moved before the 3705 fence. 3706 acq_rel Same constraints as both acquire and release. 3707 seq_cst - If a load atomic then same constraints as acquire, plus no 3708 preceding sequentially consistent load atomic/store 3709 atomic/atomicrmw/fence instruction can be moved after the 3710 seq_cst. 3711 - If a store atomic then the same constraints as release, plus 3712 no following sequentially consistent load atomic/store 3713 atomic/atomicrmw/fence instruction can be moved before the 3714 seq_cst. 3715 - If an atomicrmw/fence then same constraints as acq_rel. 3716 ============ ============================================================== 3717 3718Trap Handler ABI 3719~~~~~~~~~~~~~~~~ 3720 3721For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes 3722(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports 3723the ``s_trap`` instruction with the following usage: 3724 3725 .. table:: AMDGPU Trap Handler for AMDHSA OS 3726 :name: amdgpu-trap-handler-for-amdhsa-os-table 3727 3728 =================== =============== =============== ======================= 3729 Usage Code Sequence Trap Handler Description 3730 Inputs 3731 =================== =============== =============== ======================= 3732 reserved ``s_trap 0x00`` Reserved by hardware. 3733 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA 3734 ``queue_ptr`` ``debugtrap`` 3735 ``VGPR0``: intrinsic (not 3736 ``arg`` implemented). 3737 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be 3738 ``queue_ptr`` terminated and its 3739 associated queue put 3740 into the error state. 3741 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not 3742 ``queue_ptr`` installed handled 3743 same as ``llvm.trap``. 3744 debugger breakpoint ``s_trap 0x07`` Reserved for debugger 3745 breakpoints. 3746 debugger ``s_trap 0x08`` Reserved for debugger. 3747 debugger ``s_trap 0xfe`` Reserved for debugger. 3748 debugger ``s_trap 0xff`` Reserved for debugger. 3749 =================== =============== =============== ======================= 3750 3751Unspecified OS 3752-------------- 3753 3754This section provides code conventions used when the target triple OS is 3755empty (see :ref:`amdgpu-target-triples`). 3756 3757Trap Handler ABI 3758~~~~~~~~~~~~~~~~ 3759 3760For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 3761not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 3762instructions are handled as follows: 3763 3764 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 3765 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 3766 3767 =============== =============== =========================================== 3768 Usage Code Sequence Description 3769 =============== =============== =========================================== 3770 llvm.trap s_endpgm Causes wavefront to be terminated. 3771 llvm.debugtrap *none* Compiler warning given that there is no 3772 trap handler installed. 3773 =============== =============== =========================================== 3774 3775Source Languages 3776================ 3777 3778.. _amdgpu-opencl: 3779 3780OpenCL 3781------ 3782 3783When generating code for the OpenCL language the target triple environment 3784should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`). 3785 3786When the language is OpenCL the following differences occur: 3787 37881. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 37892. The AMDGPU backend adds additional arguments to the kernel. 37903. Additional metadata is generated 3791 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`). 3792 3793.. TODO 3794 Specify what affect this has. Hidden arguments added. Additional metadata 3795 generated. 3796 3797.. _amdgpu-hcc: 3798 3799HCC 3800--- 3801 3802When generating code for the OpenCL language the target triple environment 3803should be ``hcc`` (see :ref:`amdgpu-target-triples`). 3804 3805When the language is OpenCL the following differences occur: 3806 38071. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 3808 3809.. TODO 3810 Specify what affect this has. 3811 3812Assembler 3813--------- 3814 3815AMDGPU backend has LLVM-MC based assembler which is currently in development. 3816It supports AMDGCN GFX6-GFX9. 3817 3818This section describes general syntax for instructions and operands. For more 3819information about instructions, their semantics and supported combinations of 3820operands, refer to one of instruction set architecture manuals 3821[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_. 3822 3823An instruction has the following syntax (register operands are normally 3824comma-separated while extra operands are space-separated): 3825 3826*<opcode> <register_operand0>, ... <extra_operand0> ...* 3827 3828Operands 3829~~~~~~~~ 3830 3831The following syntax for register operands is supported: 3832 3833* SGPR registers: s0, ... or s[0], ... 3834* VGPR registers: v0, ... or v[0], ... 3835* TTMP registers: ttmp0, ... or ttmp[0], ... 3836* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi) 3837* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi) 3838* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ... 3839* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3] 3840* Register index expressions: v[2*2], s[1-1:2-1] 3841* 'off' indicates that an operand is not enabled 3842 3843The following extra operands are supported: 3844 3845* offset, offset0, offset1 3846* idxen, offen bits 3847* glc, slc, tfe bits 3848* waitcnt: integer or combination of counter values 3849* VOP3 modifiers: 3850 3851 - abs (\| \|), neg (\-) 3852 3853* DPP modifiers: 3854 3855 - row_shl, row_shr, row_ror, row_rol 3856 - row_mirror, row_half_mirror, row_bcast 3857 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm 3858 - row_mask, bank_mask, bound_ctrl 3859 3860* SDWA modifiers: 3861 3862 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD) 3863 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE) 3864 - abs, neg, sext 3865 3866Instruction Examples 3867~~~~~~~~~~~~~~~~~~~~ 3868 3869DS 3870~~ 3871 3872.. code-block:: nasm 3873 3874 ds_add_u32 v2, v4 offset:16 3875 ds_write_src2_b64 v2 offset0:4 offset1:8 3876 ds_cmpst_f32 v2, v4, v6 3877 ds_min_rtn_f64 v[8:9], v2, v[4:5] 3878 3879 3880For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. 3881 3882FLAT 3883++++ 3884 3885.. code-block:: nasm 3886 3887 flat_load_dword v1, v[3:4] 3888 flat_store_dwordx3 v[3:4], v[5:7] 3889 flat_atomic_swap v1, v[3:4], v5 glc 3890 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 3891 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 3892 3893For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. 3894 3895MUBUF 3896+++++ 3897 3898.. code-block:: nasm 3899 3900 buffer_load_dword v1, off, s[4:7], s1 3901 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 3902 buffer_store_format_xy v[1:2], off, s[4:7], s1 3903 buffer_wbinvl1 3904 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 3905 3906For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. 3907 3908SMRD/SMEM 3909+++++++++ 3910 3911.. code-block:: nasm 3912 3913 s_load_dword s1, s[2:3], 0xfc 3914 s_load_dwordx8 s[8:15], s[2:3], s4 3915 s_load_dwordx16 s[88:103], s[2:3], s4 3916 s_dcache_inv_vol 3917 s_memtime s[4:5] 3918 3919For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. 3920 3921SOP1 3922++++ 3923 3924.. code-block:: nasm 3925 3926 s_mov_b32 s1, s2 3927 s_mov_b64 s[0:1], 0x80000000 3928 s_cmov_b32 s1, 200 3929 s_wqm_b64 s[2:3], s[4:5] 3930 s_bcnt0_i32_b64 s1, s[2:3] 3931 s_swappc_b64 s[2:3], s[4:5] 3932 s_cbranch_join s[4:5] 3933 3934For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. 3935 3936SOP2 3937++++ 3938 3939.. code-block:: nasm 3940 3941 s_add_u32 s1, s2, s3 3942 s_and_b64 s[2:3], s[4:5], s[6:7] 3943 s_cselect_b32 s1, s2, s3 3944 s_andn2_b32 s2, s4, s6 3945 s_lshr_b64 s[2:3], s[4:5], s6 3946 s_ashr_i32 s2, s4, s6 3947 s_bfm_b64 s[2:3], s4, s6 3948 s_bfe_i64 s[2:3], s[4:5], s6 3949 s_cbranch_g_fork s[4:5], s[6:7] 3950 3951For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. 3952 3953SOPC 3954++++ 3955 3956.. code-block:: nasm 3957 3958 s_cmp_eq_i32 s1, s2 3959 s_bitcmp1_b32 s1, s2 3960 s_bitcmp0_b64 s[2:3], s4 3961 s_setvskip s3, s5 3962 3963For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. 3964 3965SOPP 3966++++ 3967 3968.. code-block:: nasm 3969 3970 s_barrier 3971 s_nop 2 3972 s_endpgm 3973 s_waitcnt 0 ; Wait for all counters to be 0 3974 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 3975 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 3976 s_sethalt 9 3977 s_sleep 10 3978 s_sendmsg 0x1 3979 s_sendmsg sendmsg(MSG_INTERRUPT) 3980 s_trap 1 3981 3982For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. 3983 3984Unless otherwise mentioned, little verification is performed on the operands 3985of SOPP Instructions, so it is up to the programmer to be familiar with the 3986range or acceptable values. 3987 3988VALU 3989++++ 3990 3991For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 3992the assembler will automatically use optimal encoding based on its operands. 3993To force specific encoding, one can add a suffix to the opcode of the instruction: 3994 3995* _e32 for 32-bit VOP1/VOP2/VOPC 3996* _e64 for 64-bit VOP3 3997* _dpp for VOP_DPP 3998* _sdwa for VOP_SDWA 3999 4000VOP1/VOP2/VOP3/VOPC examples: 4001 4002.. code-block:: nasm 4003 4004 v_mov_b32 v1, v2 4005 v_mov_b32_e32 v1, v2 4006 v_nop 4007 v_cvt_f64_i32_e32 v[1:2], v2 4008 v_floor_f32_e32 v1, v2 4009 v_bfrev_b32_e32 v1, v2 4010 v_add_f32_e32 v1, v2, v3 4011 v_mul_i32_i24_e64 v1, v2, 3 4012 v_mul_i32_i24_e32 v1, -3, v3 4013 v_mul_i32_i24_e32 v1, -100, v3 4014 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 4015 v_max_f16_e32 v1, v2, v3 4016 4017VOP_DPP examples: 4018 4019.. code-block:: nasm 4020 4021 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 4022 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4023 v_mov_b32 v0, v0 wave_shl:1 4024 v_mov_b32 v0, v0 row_mirror 4025 v_mov_b32 v0, v0 row_bcast:31 4026 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 4027 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4028 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4029 4030VOP_SDWA examples: 4031 4032.. code-block:: nasm 4033 4034 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 4035 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 4036 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 4037 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 4038 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 4039 4040For full list of supported instructions, refer to "Vector ALU instructions". 4041 4042HSA Code Object Directives 4043~~~~~~~~~~~~~~~~~~~~~~~~~~ 4044 4045AMDGPU ABI defines auxiliary data in output code object. In assembly source, 4046one can specify them with assembler directives. 4047 4048.hsa_code_object_version major, minor 4049+++++++++++++++++++++++++++++++++++++ 4050 4051*major* and *minor* are integers that specify the version of the HSA code 4052object that will be generated by the assembler. 4053 4054.hsa_code_object_isa [major, minor, stepping, vendor, arch] 4055+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4056 4057 4058*major*, *minor*, and *stepping* are all integers that describe the instruction 4059set architecture (ISA) version of the assembly program. 4060 4061*vendor* and *arch* are quoted strings. *vendor* should always be equal to 4062"AMD" and *arch* should always be equal to "AMDGPU". 4063 4064By default, the assembler will derive the ISA version, *vendor*, and *arch* 4065from the value of the -mcpu option that is passed to the assembler. 4066 4067.amdgpu_hsa_kernel (name) 4068+++++++++++++++++++++++++ 4069 4070This directives specifies that the symbol with given name is a kernel entry point 4071(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. 4072 4073.amd_kernel_code_t 4074++++++++++++++++++ 4075 4076This directive marks the beginning of a list of key / value pairs that are used 4077to specify the amd_kernel_code_t object that will be emitted by the assembler. 4078The list must be terminated by the *.end_amd_kernel_code_t* directive. For 4079any amd_kernel_code_t values that are unspecified a default value will be 4080used. The default value for all keys is 0, with the following exceptions: 4081 4082- *kernel_code_version_major* defaults to 1. 4083- *machine_kind* defaults to 1. 4084- *machine_version_major*, *machine_version_minor*, and 4085 *machine_version_stepping* are derived from the value of the -mcpu option 4086 that is passed to the assembler. 4087- *kernel_code_entry_byte_offset* defaults to 256. 4088- *wavefront_size* defaults to 6. 4089- *kernarg_segment_alignment*, *group_segment_alignment*, and 4090 *private_segment_alignment* default to 4. Note that alignments are specified 4091 as a power of two, so a value of **n** means an alignment of 2^ **n**. 4092 4093The *.amd_kernel_code_t* directive must be placed immediately after the 4094function label and before any instructions. 4095 4096For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 4097comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 4098 4099Here is an example of a minimal amd_kernel_code_t specification: 4100 4101.. code-block:: none 4102 4103 .hsa_code_object_version 1,0 4104 .hsa_code_object_isa 4105 4106 .hsatext 4107 .globl hello_world 4108 .p2align 8 4109 .amdgpu_hsa_kernel hello_world 4110 4111 hello_world: 4112 4113 .amd_kernel_code_t 4114 enable_sgpr_kernarg_segment_ptr = 1 4115 is_ptr64 = 1 4116 compute_pgm_rsrc1_vgprs = 0 4117 compute_pgm_rsrc1_sgprs = 0 4118 compute_pgm_rsrc2_user_sgpr = 2 4119 kernarg_segment_byte_size = 8 4120 wavefront_sgpr_count = 2 4121 workitem_vgpr_count = 3 4122 .end_amd_kernel_code_t 4123 4124 s_load_dwordx2 s[0:1], s[0:1] 0x0 4125 v_mov_b32 v0, 3.14159 4126 s_waitcnt lgkmcnt(0) 4127 v_mov_b32 v1, s0 4128 v_mov_b32 v2, s1 4129 flat_store_dword v[1:2], v0 4130 s_endpgm 4131 .Lfunc_end0: 4132 .size hello_world, .Lfunc_end0-hello_world 4133 4134Additional Documentation 4135======================== 4136 4137.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 4138.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 4139.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 4140.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 4141.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 4142.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 4143.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 4144.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 4145.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ 4146.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ 4147.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 4148.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 4149.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 4150.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 4151.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 4152.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 4153