1============================== 2User Guide for AMDGPU Back-end 3============================== 4 5Introduction 6============ 7 8The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with 9the R600 family up until the current Volcanic Islands (GCN Gen 3). 10 11Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_ 12for additional documentation. 13 14Conventions 15=========== 16 17Address Spaces 18-------------- 19 20The AMDGPU back-end uses the following address space mapping: 21 22 ============= ============================================ 23 Address Space Memory Space 24 ============= ============================================ 25 0 Private 26 1 Global 27 2 Constant 28 3 Local 29 4 Generic (Flat) 30 5 Region 31 ============= ============================================ 32 33The terminology in the table, aside from the region memory space, is from the 34OpenCL standard. 35 36 37Assembler 38========= 39 40AMDGPU backend has LLVM-MC based assembler which is currently in development. 41It supports Southern Islands ISA, Sea Islands and Volcanic Islands. 42 43This document describes general syntax for instructions and operands. For more 44information about instructions, their semantics and supported combinations 45of operands, refer to one of Instruction Set Architecture manuals. 46 47An instruction has the following syntax (register operands are 48normally comma-separated while extra operands are space-separated): 49 50*<opcode> <register_operand0>, ... <extra_operand0> ...* 51 52 53Operands 54-------- 55 56The following syntax for register operands is supported: 57 58* SGPR registers: s0, ... or s[0], ... 59* VGPR registers: v0, ... or v[0], ... 60* TTMP registers: ttmp0, ... or ttmp[0], ... 61* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi) 62* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi) 63* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ... 64* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3] 65* Register index expressions: v[2*2], s[1-1:2-1] 66* 'off' indicates that an operand is not enabled 67 68The following extra operands are supported: 69 70* offset, offset0, offset1 71* idxen, offen bits 72* glc, slc, tfe bits 73* waitcnt: integer or combination of counter values 74* VOP3 modifiers: 75 76 - abs (\| \|), neg (\-) 77 78* DPP modifiers: 79 80 - row_shl, row_shr, row_ror, row_rol 81 - row_mirror, row_half_mirror, row_bcast 82 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm 83 - row_mask, bank_mask, bound_ctrl 84 85* SDWA modifiers: 86 87 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD) 88 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE) 89 - abs, neg, sext 90 91DS Instructions Examples 92------------------------ 93 94.. code-block:: nasm 95 96 ds_add_u32 v2, v4 offset:16 97 ds_write_src2_b64 v2 offset0:4 offset1:8 98 ds_cmpst_f32 v2, v4, v6 99 ds_min_rtn_f64 v[8:9], v2, v[4:5] 100 101 102For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. 103 104FLAT Instruction Examples 105-------------------------- 106 107.. code-block:: nasm 108 109 flat_load_dword v1, v[3:4] 110 flat_store_dwordx3 v[3:4], v[5:7] 111 flat_atomic_swap v1, v[3:4], v5 glc 112 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 113 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 114 115For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. 116 117MUBUF Instruction Examples 118--------------------------- 119 120.. code-block:: nasm 121 122 buffer_load_dword v1, off, s[4:7], s1 123 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 124 buffer_store_format_xy v[1:2], off, s[4:7], s1 125 buffer_wbinvl1 126 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 127 128For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. 129 130SMRD/SMEM Instruction Examples 131------------------------------- 132 133.. code-block:: nasm 134 135 s_load_dword s1, s[2:3], 0xfc 136 s_load_dwordx8 s[8:15], s[2:3], s4 137 s_load_dwordx16 s[88:103], s[2:3], s4 138 s_dcache_inv_vol 139 s_memtime s[4:5] 140 141For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. 142 143SOP1 Instruction Examples 144-------------------------- 145 146.. code-block:: nasm 147 148 s_mov_b32 s1, s2 149 s_mov_b64 s[0:1], 0x80000000 150 s_cmov_b32 s1, 200 151 s_wqm_b64 s[2:3], s[4:5] 152 s_bcnt0_i32_b64 s1, s[2:3] 153 s_swappc_b64 s[2:3], s[4:5] 154 s_cbranch_join s[4:5] 155 156For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. 157 158SOP2 Instruction Examples 159------------------------- 160 161.. code-block:: nasm 162 163 s_add_u32 s1, s2, s3 164 s_and_b64 s[2:3], s[4:5], s[6:7] 165 s_cselect_b32 s1, s2, s3 166 s_andn2_b32 s2, s4, s6 167 s_lshr_b64 s[2:3], s[4:5], s6 168 s_ashr_i32 s2, s4, s6 169 s_bfm_b64 s[2:3], s4, s6 170 s_bfe_i64 s[2:3], s[4:5], s6 171 s_cbranch_g_fork s[4:5], s[6:7] 172 173For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. 174 175SOPC Instruction Examples 176-------------------------- 177 178.. code-block:: nasm 179 180 s_cmp_eq_i32 s1, s2 181 s_bitcmp1_b32 s1, s2 182 s_bitcmp0_b64 s[2:3], s4 183 s_setvskip s3, s5 184 185For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. 186 187SOPP Instruction Examples 188-------------------------- 189 190.. code-block:: nasm 191 192 s_barrier 193 s_nop 2 194 s_endpgm 195 s_waitcnt 0 ; Wait for all counters to be 0 196 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 197 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 198 s_sethalt 9 199 s_sleep 10 200 s_sendmsg 0x1 201 s_sendmsg sendmsg(MSG_INTERRUPT) 202 s_trap 1 203 204For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. 205 206Unless otherwise mentioned, little verification is performed on the operands 207of SOPP Instrucitons, so it is up to the programmer to be familiar with the 208range or acceptable values. 209 210Vector ALU Instruction Examples 211------------------------------- 212 213For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 214the assembler will automatically use optimal encoding based on its operands. 215To force specific encoding, one can add a suffix to the opcode of the instruction: 216 217* _e32 for 32-bit VOP1/VOP2/VOPC 218* _e64 for 64-bit VOP3 219* _dpp for VOP_DPP 220* _sdwa for VOP_SDWA 221 222VOP1/VOP2/VOP3/VOPC examples: 223 224.. code-block:: nasm 225 226 v_mov_b32 v1, v2 227 v_mov_b32_e32 v1, v2 228 v_nop 229 v_cvt_f64_i32_e32 v[1:2], v2 230 v_floor_f32_e32 v1, v2 231 v_bfrev_b32_e32 v1, v2 232 v_add_f32_e32 v1, v2, v3 233 v_mul_i32_i24_e64 v1, v2, 3 234 v_mul_i32_i24_e32 v1, -3, v3 235 v_mul_i32_i24_e32 v1, -100, v3 236 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 237 v_max_f16_e32 v1, v2, v3 238 239VOP_DPP examples: 240 241.. code-block:: nasm 242 243 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 244 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 245 v_mov_b32 v0, v0 wave_shl:1 246 v_mov_b32 v0, v0 row_mirror 247 v_mov_b32 v0, v0 row_bcast:31 248 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 249 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 250 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 251 252VOP_SDWA examples: 253 254.. code-block:: nasm 255 256 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 257 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 258 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 259 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 260 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 261 262For full list of supported instructions, refer to "Vector ALU instructions". 263 264HSA Code Object Directives 265-------------------------- 266 267AMDGPU ABI defines auxiliary data in output code object. In assembly source, 268one can specify them with assembler directives. 269 270.hsa_code_object_version major, minor 271^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 272 273*major* and *minor* are integers that specify the version of the HSA code 274object that will be generated by the assembler. 275 276.hsa_code_object_isa [major, minor, stepping, vendor, arch] 277^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 278 279*major*, *minor*, and *stepping* are all integers that describe the instruction 280set architecture (ISA) version of the assembly program. 281 282*vendor* and *arch* are quoted strings. *vendor* should always be equal to 283"AMD" and *arch* should always be equal to "AMDGPU". 284 285By default, the assembler will derive the ISA version, *vendor*, and *arch* 286from the value of the -mcpu option that is passed to the assembler. 287 288.amdgpu_hsa_kernel (name) 289^^^^^^^^^^^^^^^^^^^^^^^^^ 290 291This directives specifies that the symbol with given name is a kernel entry point 292(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. 293 294.amd_kernel_code_t 295^^^^^^^^^^^^^^^^^^ 296 297This directive marks the beginning of a list of key / value pairs that are used 298to specify the amd_kernel_code_t object that will be emitted by the assembler. 299The list must be terminated by the *.end_amd_kernel_code_t* directive. For 300any amd_kernel_code_t values that are unspecified a default value will be 301used. The default value for all keys is 0, with the following exceptions: 302 303- *kernel_code_version_major* defaults to 1. 304- *machine_kind* defaults to 1. 305- *machine_version_major*, *machine_version_minor*, and 306 *machine_version_stepping* are derived from the value of the -mcpu option 307 that is passed to the assembler. 308- *kernel_code_entry_byte_offset* defaults to 256. 309- *wavefront_size* defaults to 6. 310- *kernarg_segment_alignment*, *group_segment_alignment*, and 311 *private_segment_alignment* default to 4. Note that alignments are specified 312 as a power of two, so a value of **n** means an alignment of 2^ **n**. 313 314The *.amd_kernel_code_t* directive must be placed immediately after the 315function label and before any instructions. 316 317For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 318comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 319 320Here is an example of a minimal amd_kernel_code_t specification: 321 322.. code-block:: none 323 324 .hsa_code_object_version 1,0 325 .hsa_code_object_isa 326 327 .hsatext 328 .globl hello_world 329 .p2align 8 330 .amdgpu_hsa_kernel hello_world 331 332 hello_world: 333 334 .amd_kernel_code_t 335 enable_sgpr_kernarg_segment_ptr = 1 336 is_ptr64 = 1 337 compute_pgm_rsrc1_vgprs = 0 338 compute_pgm_rsrc1_sgprs = 0 339 compute_pgm_rsrc2_user_sgpr = 2 340 kernarg_segment_byte_size = 8 341 wavefront_sgpr_count = 2 342 workitem_vgpr_count = 3 343 .end_amd_kernel_code_t 344 345 s_load_dwordx2 s[0:1], s[0:1] 0x0 346 v_mov_b32 v0, 3.14159 347 s_waitcnt lgkmcnt(0) 348 v_mov_b32 v1, s0 349 v_mov_b32 v2, s1 350 flat_store_dword v[1:2], v0 351 s_endpgm 352 .Lfunc_end0: 353 .size hello_world, .Lfunc_end0-hello_world 354