1==============================
2User Guide for AMDGPU Back-end
3==============================
4
5Introduction
6============
7
8The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
9the R600 family up until the current Volcanic Islands (GCN Gen 3).
10
11Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
12for additional documentation.
13
14Conventions
15===========
16
17Address Spaces
18--------------
19
20The AMDGPU back-end uses the following address space mapping:
21
22   ============= ============================================
23   Address Space Memory Space
24   ============= ============================================
25   0             Private
26   1             Global
27   2             Constant
28   3             Local
29   4             Generic (Flat)
30   5             Region
31   ============= ============================================
32
33The terminology in the table, aside from the region memory space, is from the
34OpenCL standard.
35
36
37Assembler
38=========
39
40AMDGPU backend has LLVM-MC based assembler which is currently in development.
41It supports Southern Islands ISA, Sea Islands and Volcanic Islands.
42
43This document describes general syntax for instructions and operands. For more
44information about instructions, their semantics and supported combinations
45of operands, refer to one of Instruction Set Architecture manuals.
46
47An instruction has the following syntax (register operands are
48normally comma-separated while extra operands are space-separated):
49
50*<opcode> <register_operand0>, ... <extra_operand0> ...*
51
52
53Operands
54--------
55
56The following syntax for register operands is supported:
57
58* SGPR registers: s0, ... or s[0], ...
59* VGPR registers: v0, ... or v[0], ...
60* TTMP registers: ttmp0, ... or ttmp[0], ...
61* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
62* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
63* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
64* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
65* Register index expressions: v[2*2], s[1-1:2-1]
66* 'off' indicates that an operand is not enabled
67
68The following extra operands are supported:
69
70* offset, offset0, offset1
71* idxen, offen bits
72* glc, slc, tfe bits
73* waitcnt: integer or combination of counter values
74* VOP3 modifiers:
75
76  - abs (\| \|), neg (\-)
77
78* DPP modifiers:
79
80  - row_shl, row_shr, row_ror, row_rol
81  - row_mirror, row_half_mirror, row_bcast
82  - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
83  - row_mask, bank_mask, bound_ctrl
84
85* SDWA modifiers:
86
87  - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
88  - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
89  - abs, neg, sext
90
91DS Instructions Examples
92------------------------
93
94.. code-block:: nasm
95
96  ds_add_u32 v2, v4 offset:16
97  ds_write_src2_b64 v2 offset0:4 offset1:8
98  ds_cmpst_f32 v2, v4, v6
99  ds_min_rtn_f64 v[8:9], v2, v[4:5]
100
101
102For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
103
104FLAT Instruction Examples
105--------------------------
106
107.. code-block:: nasm
108
109  flat_load_dword v1, v[3:4]
110  flat_store_dwordx3 v[3:4], v[5:7]
111  flat_atomic_swap v1, v[3:4], v5 glc
112  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
113  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
114
115For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
116
117MUBUF Instruction Examples
118---------------------------
119
120.. code-block:: nasm
121
122  buffer_load_dword v1, off, s[4:7], s1
123  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
124  buffer_store_format_xy v[1:2], off, s[4:7], s1
125  buffer_wbinvl1
126  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
127
128For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
129
130SMRD/SMEM Instruction Examples
131-------------------------------
132
133.. code-block:: nasm
134
135  s_load_dword s1, s[2:3], 0xfc
136  s_load_dwordx8 s[8:15], s[2:3], s4
137  s_load_dwordx16 s[88:103], s[2:3], s4
138  s_dcache_inv_vol
139  s_memtime s[4:5]
140
141For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
142
143SOP1 Instruction Examples
144--------------------------
145
146.. code-block:: nasm
147
148  s_mov_b32 s1, s2
149  s_mov_b64 s[0:1], 0x80000000
150  s_cmov_b32 s1, 200
151  s_wqm_b64 s[2:3], s[4:5]
152  s_bcnt0_i32_b64 s1, s[2:3]
153  s_swappc_b64 s[2:3], s[4:5]
154  s_cbranch_join s[4:5]
155
156For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
157
158SOP2 Instruction Examples
159-------------------------
160
161.. code-block:: nasm
162
163  s_add_u32 s1, s2, s3
164  s_and_b64 s[2:3], s[4:5], s[6:7]
165  s_cselect_b32 s1, s2, s3
166  s_andn2_b32 s2, s4, s6
167  s_lshr_b64 s[2:3], s[4:5], s6
168  s_ashr_i32 s2, s4, s6
169  s_bfm_b64 s[2:3], s4, s6
170  s_bfe_i64 s[2:3], s[4:5], s6
171  s_cbranch_g_fork s[4:5], s[6:7]
172
173For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
174
175SOPC Instruction Examples
176--------------------------
177
178.. code-block:: nasm
179
180  s_cmp_eq_i32 s1, s2
181  s_bitcmp1_b32 s1, s2
182  s_bitcmp0_b64 s[2:3], s4
183  s_setvskip s3, s5
184
185For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
186
187SOPP Instruction Examples
188--------------------------
189
190.. code-block:: nasm
191
192  s_barrier
193  s_nop 2
194  s_endpgm
195  s_waitcnt 0 ; Wait for all counters to be 0
196  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
197  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
198  s_sethalt 9
199  s_sleep 10
200  s_sendmsg 0x1
201  s_sendmsg sendmsg(MSG_INTERRUPT)
202  s_trap 1
203
204For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
205
206Unless otherwise mentioned, little verification is performed on the operands
207of SOPP Instrucitons, so it is up to the programmer to be familiar with the
208range or acceptable values.
209
210Vector ALU Instruction Examples
211-------------------------------
212
213For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
214the assembler will automatically use optimal encoding based on its operands.
215To force specific encoding, one can add a suffix to the opcode of the instruction:
216
217* _e32 for 32-bit VOP1/VOP2/VOPC
218* _e64 for 64-bit VOP3
219* _dpp for VOP_DPP
220* _sdwa for VOP_SDWA
221
222VOP1/VOP2/VOP3/VOPC examples:
223
224.. code-block:: nasm
225
226  v_mov_b32 v1, v2
227  v_mov_b32_e32 v1, v2
228  v_nop
229  v_cvt_f64_i32_e32 v[1:2], v2
230  v_floor_f32_e32 v1, v2
231  v_bfrev_b32_e32 v1, v2
232  v_add_f32_e32 v1, v2, v3
233  v_mul_i32_i24_e64 v1, v2, 3
234  v_mul_i32_i24_e32 v1, -3, v3
235  v_mul_i32_i24_e32 v1, -100, v3
236  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
237  v_max_f16_e32 v1, v2, v3
238
239VOP_DPP examples:
240
241.. code-block:: nasm
242
243  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
244  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
245  v_mov_b32 v0, v0 wave_shl:1
246  v_mov_b32 v0, v0 row_mirror
247  v_mov_b32 v0, v0 row_bcast:31
248  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
249  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
250  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
251
252VOP_SDWA examples:
253
254.. code-block:: nasm
255
256  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
257  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
258  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
259  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
260  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
261
262For full list of supported instructions, refer to "Vector ALU instructions".
263
264HSA Code Object Directives
265--------------------------
266
267AMDGPU ABI defines auxiliary data in output code object. In assembly source,
268one can specify them with assembler directives.
269
270.hsa_code_object_version major, minor
271^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
272
273*major* and *minor* are integers that specify the version of the HSA code
274object that will be generated by the assembler.
275
276.hsa_code_object_isa [major, minor, stepping, vendor, arch]
277^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
278
279*major*, *minor*, and *stepping* are all integers that describe the instruction
280set architecture (ISA) version of the assembly program.
281
282*vendor* and *arch* are quoted strings.  *vendor* should always be equal to
283"AMD" and *arch* should always be equal to "AMDGPU".
284
285By default, the assembler will derive the ISA version, *vendor*, and *arch*
286from the value of the -mcpu option that is passed to the assembler.
287
288.amdgpu_hsa_kernel (name)
289^^^^^^^^^^^^^^^^^^^^^^^^^
290
291This directives specifies that the symbol with given name is a kernel entry point
292(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
293
294.amd_kernel_code_t
295^^^^^^^^^^^^^^^^^^
296
297This directive marks the beginning of a list of key / value pairs that are used
298to specify the amd_kernel_code_t object that will be emitted by the assembler.
299The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
300any amd_kernel_code_t values that are unspecified a default value will be
301used.  The default value for all keys is 0, with the following exceptions:
302
303- *kernel_code_version_major* defaults to 1.
304- *machine_kind* defaults to 1.
305- *machine_version_major*, *machine_version_minor*, and
306  *machine_version_stepping* are derived from the value of the -mcpu option
307  that is passed to the assembler.
308- *kernel_code_entry_byte_offset* defaults to 256.
309- *wavefront_size* defaults to 6.
310- *kernarg_segment_alignment*, *group_segment_alignment*, and
311  *private_segment_alignment* default to 4.  Note that alignments are specified
312  as a power of two, so a value of **n** means an alignment of 2^ **n**.
313
314The *.amd_kernel_code_t* directive must be placed immediately after the
315function label and before any instructions.
316
317For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
318comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
319
320Here is an example of a minimal amd_kernel_code_t specification:
321
322.. code-block:: none
323
324   .hsa_code_object_version 1,0
325   .hsa_code_object_isa
326
327   .hsatext
328   .globl  hello_world
329   .p2align 8
330   .amdgpu_hsa_kernel hello_world
331
332   hello_world:
333
334      .amd_kernel_code_t
335         enable_sgpr_kernarg_segment_ptr = 1
336         is_ptr64 = 1
337         compute_pgm_rsrc1_vgprs = 0
338         compute_pgm_rsrc1_sgprs = 0
339         compute_pgm_rsrc2_user_sgpr = 2
340         kernarg_segment_byte_size = 8
341         wavefront_sgpr_count = 2
342         workitem_vgpr_count = 3
343     .end_amd_kernel_code_t
344
345     s_load_dwordx2 s[0:1], s[0:1] 0x0
346     v_mov_b32 v0, 3.14159
347     s_waitcnt lgkmcnt(0)
348     v_mov_b32 v1, s0
349     v_mov_b32 v2, s1
350     flat_store_dword v[1:2], v0
351     s_endpgm
352   .Lfunc_end0:
353        .size   hello_world, .Lfunc_end0-hello_world
354