1# Allow Location Descriptions on the DWARF Expression Stack <!-- omit in toc -->
2
3- [Extension](#extension)
4- [Heterogeneous Computing Devices](#heterogeneous-computing-devices)
5- [DWARF 5](#dwarf-5)
6  - [How DWARF Maps Source Language To Hardware](#how-dwarf-maps-source-language-to-hardware)
7  - [Examples](#examples)
8    - [Dynamic Array Size](#dynamic-array-size)
9    - [Variable Location in Register](#variable-location-in-register)
10    - [Variable Location in Memory](#variable-location-in-memory)
11    - [Variable Spread Across Different Locations](#variable-spread-across-different-locations)
12    - [Offsetting a Composite Location](#offsetting-a-composite-location)
13  - [Limitations](#limitations)
14- [Extension Solution](#extension-solution)
15  - [Location Description](#location-description)
16  - [Stack Location Description Operations](#stack-location-description-operations)
17  - [Examples](#examples-1)
18    - [Source Language Variable Spilled to Part of a Vector Register](#source-language-variable-spilled-to-part-of-a-vector-register)
19    - [Source Language Variable Spread Across Multiple Vector Registers](#source-language-variable-spread-across-multiple-vector-registers)
20    - [Source Language Variable Spread Across Multiple Kinds of Locations](#source-language-variable-spread-across-multiple-kinds-of-locations)
21    - [Address Spaces](#address-spaces)
22    - [Bit Offsets](#bit-offsets)
23  - [Call Frame Information (CFI)](#call-frame-information-cfi)
24  - [Objects Not In Byte Aligned Global Memory](#objects-not-in-byte-aligned-global-memory)
25  - [Higher Order Operations](#higher-order-operations)
26  - [Objects In Multiple Places](#objects-in-multiple-places)
27- [Conclusion](#conclusion)
28- [Further Information](#further-information)
29
30# Extension
31
32In DWARF 5, expressions are evaluated using a typed value stack, a separate
33location area, and an independent loclist mechanism. This extension unifies all
34three mechanisms into a single generalized DWARF expression evaluation model
35that allows both typed values and location descriptions to be manipulated on the
36evaluation stack. Both single and multiple location descriptions are supported
37on the stack. In addition, the call frame information (CFI) is extended to
38support the full generality of location descriptions. This is done in a manner
39that is backwards compatible with DWARF 5. The extension involves changes to the
40DWARF 5 sections 2.5 (pp 26-38), 2.6 (pp 38-45), and 6.4 (pp 171-182).
41
42The extension permits operations to act on location descriptions in an
43incremental, consistent, and composable manner. It allows a small number of
44operations to be defined to address the requirements of heterogeneous devices as
45well as providing benefits to non-heterogeneous devices. It acts as a foundation
46to provide support for other issues that have been raised that would benefit all
47devices.
48
49Other approaches were explored that involved adding specialized operations and
50rules. However, these resulted in the need for more operations that did not
51compose. It also resulted in operations with context sensitive semantics and
52corner cases that had to be defined. The observation was that numerous
53specialized context sensitive operations are harder for both produces and
54consumers than a smaller number of general composable operations that have
55consistent semantics regardless of context.
56
57The following sections first describe heterogeneous devices and the features
58they have that are not addressed by DWARF 5. Then a brief simplified overview of
59the DWARF 5 expression evaluation model is presented that highlights the
60difficulties for supporting the heterogeneous features. Finally, an overview of
61the extension is presented, using simplified examples to illustrate how it can
62address the issues of heterogeneous devices and also benefit non-heterogeneous
63devices. References to further information are provided.
64
65# Heterogeneous Computing Devices
66
67GPUs and other heterogeneous computing devices have features not common to CPU
68computing devices.
69
70These devices often have many more registers than a CPU. This helps reduce
71memory accesses which tend to be more expensive than on a CPU due to the much
72larger number of threads concurrently executing. In addition to traditional
73scalar registers of a CPU, these devices often have many wide vector registers.
74
75![Example GPU Hardware](images/example-gpu-hardware.png)
76
77They may support masked vector instructions that are used by the compiler to map
78high level language threads onto the lanes of the vector registers. As a
79consequence, multiple language threads execute in lockstep as the vector
80instructions are executed. This is termed single instruction multiple thread
81(SIMT) execution.
82
83![SIMT/SIMD Execution Model](images/simt-execution-model.png)
84
85GPUs can have multiple memory address spaces in addition to the single global
86memory address space of a CPU. These additional address spaces are accessed
87using distinct instructions and are often local to a particular thread or group
88of threads.
89
90For example, a GPU may have a per thread block address space that is implemented
91as scratch pad memory with explicit hardware support to isolate portions to
92specific groups of threads created as a single thread block.
93
94A GPU may also use global memory in a non linear manner. For example, to support
95providing a SIMT per lane address space efficiently, there may be instructions
96that support interleaved access.
97
98Through optimization, the source variables may be located across these different
99storage kinds. SIMT execution requires locations to be able to express selection
100of runtime defined pieces of vector registers. With the more complex locations,
101there is a benefit to be able to factorize their calculation which requires all
102location kinds to be supported uniformly, otherwise duplication is necessary.
103
104# DWARF 5
105
106Before presenting the proposed solution to supporting heterogeneous devices, a
107brief overview of the DWARF 5 expression evaluation model will be given to
108highlight the aspects being addressed by the extension.
109
110## How DWARF Maps Source Language To Hardware
111
112DWARF is a standardized way to specify debug information. It describes source
113language entities such as compilation units, functions, types, variables, etc.
114It is either embedded directly in sections of the code object executables, or
115split into separate files that they reference.
116
117DWARF maps between source program language entities and their hardware
118representations. For example:
119
120- It maps a hardware instruction program counter to a source language program
121  line, and vice versa.
122- It maps a source language function to the hardware instruction program counter
123  for its entry point.
124- It maps a source language variable to its hardware location when at a
125  particular program counter.
126- It provides information to allow virtual unwinding of hardware registers for a
127  source language function call stack.
128- In addition, it provides numerous other information about the source language
129  program.
130
131In particular, there is great diversity in the way a source language entity
132could be mapped to a hardware location. The location may involve runtime values.
133For example, a source language variable location could be:
134
135- In register.
136- At a memory address.
137- At an offset from the current stack pointer.
138- Optimized away, but with a known compiler time value.
139- Optimized away, but with an unknown value, such as happens for unused
140  variables.
141- Spread across combination of the above kinds of locations.
142- At a memory address, but also transiently loaded into registers.
143
144To support this DWARF 5 defines a rich expression language comprised of loclist
145expressions and operation expressions. Loclist expressions allow the result to
146vary depending on the PC. Operation expressions are made up of a list of
147operations that are evaluated on a simple stack machine.
148
149A DWARF expression can be used as the value of different attributes of different
150debug information entries (DIE). A DWARF expression can also be used as an
151argument to call frame information information (CFI) entry operations. An
152expression is evaluated in a context dictated by where it is used. The context
153may include:
154
155- Whether the expression needs to produce a value or the location of an entity.
156- The current execution point including process, thread, PC, and stack frame.
157- Some expressions are evaluated with the stack initialized with a specific
158  value or with the location of a base object that is available using the
159  DW_OP_push_object_address operation.
160
161## Examples
162
163The following examples illustrate how DWARF expressions involving operations are
164evaluated in DWARF 5. DWARF also has expressions involving location lists that
165are not covered in these examples.
166
167### Dynamic Array Size
168
169The first example is for an operation expression associated with a DIE attribute
170that provides the number of elements in a dynamic array type. Such an attribute
171dictates that the expression must be evaluated in the context of providing a
172value result kind.
173
174![Dynamic Array Size Example](images/01-value.example.png)
175
176In this hypothetical example, the compiler has allocated an array descriptor in
177memory and placed the descriptor's address in architecture register SGPR0. The
178first location of the array descriptor is the runtime size of the array.
179
180A possible expression to retrieve the dynamic size of the array is:
181
182    DW_OP_regval_type SGPR0 Generic
183    DW_OP_deref
184
185The expression is evaluated one operation at a time. Operations have operands
186and can pop and push entries on a stack.
187
188![Dynamic Array Size Example: Step 1](images/01-value.example.frame.1.png)
189
190The expression evaluation starts with the first DW_OP_regval_type operation.
191This operation reads the current value of an architecture register specified by
192its first operand: SGPR0. The second operand specifies the size of the data to
193read. The read value is pushed on the stack. Each stack element is a value and
194its associated type.
195
196![Dynamic Array Size Example: Step 2](images/01-value.example.frame.2.png)
197
198The type must be a DWARF base type. It specifies the encoding, byte ordering,
199and size of values of the type. DWARF defines that each architecture has a
200default generic type: it is an architecture specific integral encoding and byte
201ordering, that is the size of the architecture's global memory address.
202
203The DW_OP_deref operation pops a value off the stack, treats it as a global
204memory address, and reads the contents of that location using the generic type.
205It pushes the read value on the stack as the value and its associated generic
206type.
207
208![Dynamic Array Size Example: Step 3](images/01-value.example.frame.3.png)
209
210The evaluation stops when it reaches the end of the expression. The result of an
211expression that is evaluated with a value result kind context is the top element
212of the stack, which provides the value and its type.
213
214### Variable Location in Register
215
216This example is for an operation expression associated with a DIE attribute that
217provides the location of a source language variable. Such an attribute dictates
218that the expression must be evaluated in the context of providing a location
219result kind.
220
221DWARF defines the locations of objects in terms of location descriptions.
222
223In this example, the compiler has allocated a source language variable in
224architecture register SGPR0.
225
226![Variable Location in Register Example](images/02-reg.example.png)
227
228A possible expression to specify the location of the variable is:
229
230    DW_OP_regx SGPR0
231
232![Variable Location in Register Example: Step 1](images/02-reg.example.frame.1.png)
233
234The DW_OP_regx operation creates a location description that specifies the
235location of the architecture register specified by the operand: SGPR0. Unlike
236values, location descriptions are not pushed on the stack. Instead they are
237conceptually placed in a location area. Unlike values, location descriptions do
238not have an associated type, they only denote the location of the base of the
239object.
240
241![Variable Location in Register Example: Step 2](images/02-reg.example.frame.2.png)
242
243Again, evaluation stops when it reaches the end of the expression. The result of
244an expression that is evaluated with a location result kind context is the
245location description in the location area.
246
247### Variable Location in Memory
248
249The next example is for an operation expression associated with a DIE attribute
250that provides the location of a source language variable that is allocated in a
251stack frame. The compiler has placed the stack frame pointer in architecture
252register SGPR0, and allocated the variable at offset 0x10 from the stack frame
253base. The stack frames are allocated in global memory, so SGPR0 contains a
254global memory address.
255
256![Variable Location in Memory Example](images/03-memory.example.png)
257
258A possible expression to specify the location of the variable is:
259
260    DW_OP_regval_type SGPR0 Generic
261    DW_OP_plus_uconst 0x10
262
263![Variable Location in Memory Example: Step 1](images/03-memory.example.frame.1.png)
264
265As in the previous example, the DW_OP_regval_type operation pushes the stack
266frame pointer global memory address onto the stack. The generic type is the size
267of a global memory address.
268
269![Variable Location in Memory Example: Step 2](images/03-memory.example.frame.2.png)
270
271The DW_OP_plus_uconst operation pops a value from the stack, which must have a
272type with an integral encoding, adds the value of its operand, and pushes the
273result back on the stack with the same associated type. In this example, that
274computes the global memory address of the source language variable.
275
276![Variable Location in Memory Example: Step 3](images/03-memory.example.frame.3.png)
277
278Evaluation stops when it reaches the end of the expression. If the expression
279that is evaluated has a location result kind context, and the location area is
280empty, then the top stack element must be a value with the generic type. The
281value is implicitly popped from the stack, and treated as a global memory
282address to create a global memory location description, which is placed in the
283location area. The result of the expression is the location description in the
284location area.
285
286![Variable Location in Memory Example: Step 4](images/03-memory.example.frame.4.png)
287
288### Variable Spread Across Different Locations
289
290This example is for a source variable that is partly in a register, partly undefined, and partly in memory.
291
292![Variable Spread Across Different Locations Example](images/04-composite.example.png)
293
294DWARF defines composite location descriptions that can have one or more parts.
295Each part specifies a location description and the number of bytes used from it.
296The following operation expression creates a composite location description.
297
298    DW_OP_regx SGPR3
299    DW_OP_piece 4
300    DW_OP_piece 2
301    DW_OP_bregx SGPR0 0x10
302    DW_OP_piece 2
303
304![Variable Spread Across Different Locations Example: Step 1](images/04-composite.example.frame.1.png)
305
306The DW_OP_regx operation creates a register location description in the location
307area.
308
309![Variable Spread Across Different Locations Example: Step 2](images/04-composite.example.frame.2.png)
310
311The first DW_OP_piece operation creates an incomplete composite location
312description in the location area with a single part. The location description in
313the location area is used to define the beginning of the part for the size
314specified by the operand, namely 4 bytes.
315
316![Variable Spread Across Different Locations Example: Step 3](images/04-composite.example.frame.3.png)
317
318A subsequent DW_OP_piece adds a new part to an incomplete composite location
319description already in the location area. The parts form a contiguous set of
320bytes. If there are no other location descriptions in the location area, and no
321value on the stack, then the part implicitly uses the undefined location
322description. Again, the operand specifies the size of the part in bytes. The
323undefined location description can be used to indicate a part that has been
324optimized away. In this case, 2 bytes of undefined value.
325
326![Variable Spread Across Different Locations Example: Step 4](images/04-composite.example.frame.4.png)
327
328The DW_OP_bregx operation reads the architecture register specified by the first
329operand (SGPR0) as the generic type, adds the value of the second operand
330(0x10), and pushes the value on the stack.
331
332![Variable Spread Across Different Locations Example: Step 5](images/04-composite.example.frame.5.png)
333
334The next DW_OP_piece operation adds another part to the already created
335incomplete composite location.
336
337If there is no other location in the location area, but there is a value on
338stack, the new part is a memory location description. The memory address used is
339popped from the stack. In this case, the operand of 2 indicates there are 2
340bytes from memory.
341
342![Variable Spread Across Different Locations Example: Step 6](images/04-composite.example.frame.6.png)
343
344Evaluation stops when it reaches the end of the expression. If the expression
345that is evaluated has a location result kind context, and the location area has
346an incomplete composite location description, the incomplete composite location
347is implicitly converted to a complete composite location description. The result
348of the expression is the location description in the location area.
349
350![Variable Spread Across Different Locations Example: Step 7](images/04-composite.example.frame.7.png)
351
352### Offsetting a Composite Location
353
354This example attempts to extend the previous example to offset the composite
355location description it created. The *Variable Location in Memory* example
356conveniently used the DW_OP_plus operation to offset a memory address.
357
358    DW_OP_regx SGPR3
359    DW_OP_piece 4
360    DW_OP_piece 2
361    DW_OP_bregx SGPR0 0x10
362    DW_OP_piece 2
363    DW_OP_plus_uconst 5
364
365![Offsetting a Composite Location Example: Step 6](images/05-composite-plus.example.frame.1.png)
366
367However, DW_OP_plus cannot be used to offset a composite location. It only
368operates on the stack.
369
370![Offsetting a Composite Location Example: Step 7](images/05-composite-plus.example.frame.2.png)
371
372To offset a composite location description, the compiler would need to make a
373different composite location description, starting at the part corresponding to
374the offset. For example:
375
376    DW_OP_piece 1
377    DW_OP_bregx SGPR0 0x10
378    DW_OP_piece 2
379
380This illustrates that operations on stack values are not composable with
381operations on location descriptions.
382
383## Limitations
384
385DWARF 5 is unable to describe variables in runtime indexed parts of registers.
386This is required to describe a source variable that is located in a lane of a
387SIMT vector register.
388
389Some features only work when located in global memory. The type attribute
390expressions require a base object which could be in any kind of location.
391
392DWARF procedures can only accept global memory address arguments. This limits
393the ability to factorize the creation of locations that involve other location
394kinds.
395
396There are no vector base types. This is required to describe vector registers.
397
398There is no operation to create a memory location in a non-global address space.
399Only the dereference operation supports providing an address space.
400
401CFI location expressions do not allow composite locations or non-global address
402space memory locations. Both these are needed in optimized code for devices with
403vector registers and address spaces.
404
405Bit field offsets are only supported in a limited way for register locations.
406Supporting them in a uniform manner for all location kinds is required to
407support languages with bit sized entities.
408
409# Extension Solution
410
411This section outlines the extension to generalize the DWARF expression evaluation
412model to allow location descriptions to be manipulated on the stack. It presents
413a number of simplified examples to demonstrate the benefits and how the extension
414solves the issues of heterogeneous devices. It presents how this is done in
415a manner that is backwards compatible with DWARF 5.
416
417## Location Description
418
419In order to have consistent, composable operations that act on location
420descriptions, the extension defines a uniform way to handle all location kinds.
421That includes memory, register, implicit, implicit pointer, undefined, and
422composite location descriptions.
423
424Each kind of location description is conceptually a zero-based offset within a
425piece of storage. The storage is a contiguous linear organization of a certain
426number of bytes (see below for how this is extended to support bit sized
427storage).
428
429- For global memory, the storage is the linear stream of bytes of the
430  architecture's address size.
431- For each separate architecture register, it is the linear stream of bytes of
432  the size of that specific register.
433- For an implicit, it is the linear stream of bytes of the value when
434  represented using the value's base type which specifies the encoding, size,
435  and byte ordering.
436- For undefined, it is an infinitely sized linear stream where every byte is
437  undefined.
438- For composite, it is a linear stream of bytes defined by the composite's parts.
439
440## Stack Location Description Operations
441
442The DWARF expression stack is extended to allow each stack entry to either be a
443value or a location description.
444
445Evaluation rules are defined to implicitly convert a stack element that is a
446value to a location description, or vice versa, so that all DWARF 5 expressions
447continue to have the same semantics. This reflects that a memory address is
448effectively used as a proxy for a memory location description.
449
450For each place that allows a DWARF expression to be specified, it is defined if
451the expression is to be evaluated as a value or a location description.
452
453Existing DWARF expression operations that are used to act on memory addresses
454are generalized to act on any location description kind. For example, the
455DW_OP_deref operation pops a location description rather than a memory address
456value from the stack and reads the storage associated with the location kind
457starting at the location description's offset.
458
459Existing DWARF expression operations that create location descriptions are
460changed to pop and push location descriptions on the stack. For example, the
461DW_OP_value, DW_OP_regx, DW_OP_implicit_value, DW_OP_implicit_pointer,
462DW_OP_stack_value, and DW_OP_piece.
463
464New operations that act on location descriptions can be added. For example, a
465DW_OP_offset operation that modifies the offset of the location description on
466top of the stack. Unlike the DW_OP_plus operation that only works with memory
467address, a DW_OP_offset operation can work with any location kind.
468
469To allow incremental and nested creation of composite location descriptions, a
470DW_OP_piece_end can be defined to explicitly indicate the last part of a
471composite. Currently, creating a composite must always be the last operation of
472an expression.
473
474A DW_OP_undefined operation can be defined that explicitly creates the undefined
475location description. Currently this is only possible as a piece of a composite
476when the stack is empty.
477
478## Examples
479
480This section provides some motivating examples to illustrate the benefits that
481result from allowing location descriptions on the stack.
482
483### Source Language Variable Spilled to Part of a Vector Register
484
485A compiler generating code for a GPU may allocate a source language variable
486that it proves has the same value for every lane of a SIMT thread in a scalar
487register. It may then need to spill that scalar register. To avoid the high cost
488of spilling to memory, it may spill to a fixed lane of one of the numerous
489vector registers.
490
491![Source Language Variable Spilled to Part of a Vector Register Example](images/06-extension-spill-sgpr-to-static-vpgr-lane.example.png)
492
493The following expression defines the location of a source language variable that
494the compiler allocated in a scalar register, but had to spill to lane 5 of a
495vector register at this point of the code.
496
497    DW_OP_regx VGPR0
498    DW_OP_offset_uconst 20
499
500![Source Language Variable Spilled to Part of a Vector Register Example: Step 1](images/06-extension-spill-sgpr-to-static-vpgr-lane.example.frame.1.png)
501
502The DW_OP_regx pushes a register location description on the stack. The storage
503for the register is the size of the vector register. The register location
504description conceptually references that storage with an initial offset of 0.
505The architecture defines the byte ordering of the register.
506
507![Source Language Variable Spilled to Part of a Vector Register Example: Step 2](images/06-extension-spill-sgpr-to-static-vpgr-lane.example.frame.2.png)
508
509The DW_OP_offset_uconst pops a location description off the stack, adds its
510operand value to the offset, and pushes the updated location description back on
511the stack. In this case the source language variable is being spilled to lane 5
512and each lane's component which is 32-bits (4 bytes), so the offset is 5*4=20.
513
514![Source Language Variable Spilled to Part of a Vector Register Example: Step 3](images/06-extension-spill-sgpr-to-static-vpgr-lane.example.frame.3.png)
515
516The result of the expression evaluation is the location description on the top
517of the stack.
518
519An alternative approach could be for the target to define distinct register
520names for each part of each vector register. However, this is not practical for
521GPUs due to the sheer number of registers that would have to be defined. It
522would also not permit a runtime index into part of the whole register to be used
523as shown in the next example.
524
525### Source Language Variable Spread Across Multiple Vector Registers
526
527A compiler may generate SIMT code for a GPU. Each source language thread of
528execution is mapped to a single lane of the GPU thread. Source language
529variables that are mapped to a register, are mapped to the lane component of the
530vector registers corresponding to the source language's thread of execution.
531
532The location expression for such variables must therefore be executed in the
533context of the focused source language thread of execution. A DW_OP_push_lane
534operation can be defined to push the value of the lane for the currently focused
535source language thread of execution. The value to use would be provided by the
536consumer of DWARF when it evaluates the location expression.
537
538If the source language variable is larger than the size of the vector register
539lane component, then multiple vector registers are used. Each source language
540thread of execution will only use the vector register components for its
541associated lane.
542
543![Source Language Variable Spread Across Multiple Vector Registers Example](images/07-extension-multi-lane-vgpr.example.png)
544
545The following expression defines the location of a source language variable that
546has to occupy two vector registers. A composite location description is created
547that combines the two parts. It will give the correct result regardless of which
548lane corresponds to the source language thread of execution that the user is
549focused on.
550
551    DW_OP_regx VGPR0
552    DW_OP_push_lane
553    DW_OP_uconst 4
554    DW_OP_mul
555    DW_OP_offset
556    DW_OP_piece 4
557    DW_OP_regx VGPR1
558    DW_OP_push_lane
559    DW_OP_uconst 4
560    DW_OP_mul
561    DW_OP_offset
562    DW_OP_piece 4
563
564![Source Language Variable Spread Across Multiple Vector Registers Example: Step 1](images/07-extension-multi-lane-vgpr.example.frame.1.png)
565
566The DW_OP_regx VGPR0 pushes a location description for the first register.
567
568![Source Language Variable Spread Across Multiple Vector Registers Example: Step 2](images/07-extension-multi-lane-vgpr.example.frame.2.png)
569
570The DW_OP_push_lane; DW_OP_uconst 4; DW_OP_mul calculates the offset for the
571focused lanes vector register component as 4 times the lane number.
572
573![Source Language Variable Spread Across Multiple Vector Registers Example: Step 3](images/07-extension-multi-lane-vgpr.example.frame.3.png)
574
575![Source Language Variable Spread Across Multiple Vector Registers Example: Step 4](images/07-extension-multi-lane-vgpr.example.frame.4.png)
576
577![Source Language Variable Spread Across Multiple Vector Registers Example: Step 5](images/07-extension-multi-lane-vgpr.example.frame.5.png)
578
579The DW_OP_offset adjusts the register location description's offset to the
580runtime computed value.
581
582![Source Language Variable Spread Across Multiple Vector Registers Example: Step 6](images/07-extension-multi-lane-vgpr.example.frame.6.png)
583
584The DW_OP_piece either creates a new composite location description, or adds a
585new part to an existing incomplete one. It pops the location description to use
586for the new part. It then pops the next stack element if it is an incomplete
587composite location description, otherwise it creates a new incomplete composite
588location description with no parts. Finally it pushes the incomplete composite
589after adding the new part.
590
591In this case a register location description is added to a new incomplete
592composite location description. The 4 of the DW_OP_piece specifies the size of
593the register storage that comprises the part. Note that the 4 bytes start at the
594computed register offset.
595
596For backwards compatibility, if the stack is empty or the top stack element is
597an incomplete composite, an undefined location description is used for the part.
598If the top stack element is a generic base type value, then it is implicitly
599converted to a global memory location description with an offset equal to the
600value.
601
602![Source Language Variable Spread Across Multiple Vector Registers Example: Step 7](images/07-extension-multi-lane-vgpr.example.frame.7.png)
603
604The rest of the expression does the same for VGPR1. However, when the
605DW_OP_piece is evaluated there is an incomplete composite on the stack. So the
606VGPR1 register location description is added as a second part.
607
608![Source Language Variable Spread Across Multiple Vector Registers Example: Step 8](images/07-extension-multi-lane-vgpr.example.frame.8.png)
609
610![Source Language Variable Spread Across Multiple Vector Registers Example: Step 9](images/07-extension-multi-lane-vgpr.example.frame.9.png)
611
612![Source Language Variable Spread Across Multiple Vector Registers Example: Step 10](images/07-extension-multi-lane-vgpr.example.frame.10.png)
613
614![Source Language Variable Spread Across Multiple Vector Registers Example: Step 11](images/07-extension-multi-lane-vgpr.example.frame.11.png)
615
616![Source Language Variable Spread Across Multiple Vector Registers Example: Step 12](images/07-extension-multi-lane-vgpr.example.frame.12.png)
617
618![Source Language Variable Spread Across Multiple Vector Registers Example: Step 13](images/07-extension-multi-lane-vgpr.example.frame.13.png)
619
620At the end of the expression, if the top stack element is an incomplete
621composite location description, it is converted to a complete location
622description and returned as the result.
623
624![Source Language Variable Spread Across Multiple Vector Registers Example: Step 14](images/07-extension-multi-lane-vgpr.example.frame.14.png)
625
626### Source Language Variable Spread Across Multiple Kinds of Locations
627
628This example is the same as the previous one, except the first 2 bytes of the
629second vector register have been spilled to memory, and the last 2 bytes have
630been proven to be a constant and optimized away.
631
632![Source Language Variable Spread Across Multiple Kinds of Locations Example](images/08-extension-mixed-composite.example.png)
633
634    DW_OP_regx VGPR0
635    DW_OP_push_lane
636    DW_OP_uconst 4
637    DW_OP_mul
638    DW_OP_offset
639    DW_OP_piece 4
640    DW_OP_addr 0xbeef
641    DW_OP_piece 2
642    DW_OP_uconst 0xf00d
643    DW_OP_stack_value
644    DW_OP_piece 2
645    DW_OP_piece_end
646
647The first 6 operations are the same.
648
649![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 7](images/08-extension-mixed-composite.example.frame.1.png)
650
651The DW_OP_addr operation pushes a global memory location description on the
652stack with an offset equal to the address.
653
654![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 8](images/08-extension-mixed-composite.example.frame.2.png)
655
656The next DW_OP_piece adds the global memory location description as the next 2
657byte part of the composite.
658
659![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 9](images/08-extension-mixed-composite.example.frame.3.png)
660
661The DW_OP_uconst 0xf00d; DW_OP_stack_value pushes an implicit location
662description on the stack. The storage of the implicit location description is
663the representation of the value 0xf00d using the generic base type's encoding,
664size, and byte ordering.
665
666![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 10](images/08-extension-mixed-composite.example.frame.4.png)
667
668![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 11](images/08-extension-mixed-composite.example.frame.5.png)
669
670The final DW_OP_piece adds 2 bytes of the implicit location description as the
671third part of the composite location description.
672
673![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 12](images/08-extension-mixed-composite.example.frame.6.png)
674
675The DW_OP_piece_end operation explicitly makes the incomplete composite location
676description into a complete location description. This allows a complete
677composite location description to be created on the stack that can be used as
678the location description of another following operation. For example, the
679DW_OP_offset can be applied to it. More practically, it permits creation of
680multiple composite location descriptions on the stack which can be used to pass
681arguments to a DWARF procedure using a DW_OP_call* operation. This can be
682beneficial to factor the incrementally creation of location descriptions.
683
684![Source Language Variable Spread Across Multiple Kinds of Locations Example: Step 12](images/08-extension-mixed-composite.example.frame.7.png)
685
686### Address Spaces
687
688Heterogeneous devices can have multiple hardware supported address spaces which
689use specific hardware instructions to access them.
690
691For example, GPUs that use SIMT execution may provide hardware support to access
692memory such that each lane can see a linear memory view, while the backing
693memory is actually being accessed in an interleaved manner so that the locations
694for each lanes Nth dword are contiguous. This minimizes cache lines read by the
695SIMT execution.
696
697![Address Spaces Example](images/09-extension-form-aspace.example.png)
698
699The following expression defines the location of a source language variable that
700is allocated at offset 0x10 in the current subprograms stack frame. The
701subprogram stack frames are per lane and reside in an interleaved address space.
702
703    DW_OP_regval_type SGPR0 Generic
704    DW_OP_uconst 1
705    DW_OP_form_aspace_address
706    DW_OP_offset 0x10
707
708![Address Spaces Example: Step 1](images/09-extension-form-aspace.example.frame.1.png)
709
710The DW_OP_regval_type operation pushes the contents of SGPR0 as a generic value.
711This is the register that holds the address of the current stack frame.
712
713![Address Spaces Example: Step 2](images/09-extension-form-aspace.example.frame.2.png)
714
715The DW_OP_uconst operation pushes the address space number. Each architecture
716defines the numbers it uses in DWARF. In this case, address space 1 is being
717used as the per lane memory.
718
719![Address Spaces Example: Step 3](images/09-extension-form-aspace.example.frame.3.png)
720
721The DW_OP_form_aspace_address operation pops a value and an address space
722number. Each address space is associated with a separate storage. A memory
723location description is pushed which refers to the address space's storage, with
724an offset of the popped value.
725
726![Address Spaces Example: Step 4](images/09-extension-form-aspace.example.frame.4.png)
727
728All operations that act on location descriptions work with memory locations
729regardless of their address space.
730
731Every architecture defines address space 0 as the default global memory address
732space.
733
734Generalizing memory location descriptions to include an address space component
735avoids having to create specialized operations to work with address spaces.
736
737The source variable is at offset 0x10 in the stack frame. The DW_OP_offset
738operation works on memory location descriptions that have an address space just
739like for any other kind of location description.
740
741![Address Spaces Example: Step 5](images/09-extension-form-aspace.example.frame.5.png)
742
743The only operations in DWARF 5 that take an address space are DW_OP_xderef*.
744They treat a value as the address in a specified address space, and read its
745contents. There is no operation to actually create a location description that
746references an address space. There is no way to include address space memory
747locations in parts of composite locations.
748
749Since DW_OP_piece now takes any kind of location description for its pieces, it
750is now possible for parts of a composite to involve locations in different
751address spaces. For example, this can happen when parts of a source variable
752allocated in a register are spilled to a stack frame that resides in the
753non-global address space.
754
755### Bit Offsets
756
757With the generalization of location descriptions on the stack, it is possible to
758define a DW_OP_bit_offset operation that adjusts the offset of any kind of
759location in terms of bits rather than bytes. The offset can be a runtime
760computed value. This is generally useful for any source language that support
761bit sized entities, and for registers that are not a whole number of bytes.
762
763DWARF 5 only supports bit fields in composites using DW_OP_bit_piece. It does
764not support runtime computed offsets which can happen for bit field packed
765arrays. It is also not generally composable as it must be the last part of an
766expression.
767
768The following example defines a location description for a source variable that
769is allocated starting at bit 20 of a register. A similar expression could be
770used if the source variable was at a bit offset within memory or a particular
771address space, or if the offset is a runtime value.
772
773![Bit Offsets Example](images/10-extension-bit-offset.example.png)
774
775    DW_OP_regx SGPR3
776    DW_OP_uconst 20
777    DW_OP_bit_offset
778
779![Bit Offsets Example: Step 1](images/10-extension-bit-offset.example.frame.1.png)
780
781![Bit Offsets Example: Step 2](images/10-extension-bit-offset.example.frame.2.png)
782
783![Bit Offsets Example: Step 3](images/10-extension-bit-offset.example.frame.3.png)
784
785The DW_OP_bit_offset operation pops a value and location description from the
786stack. It pushes the location description after updating its offset using the
787value as a bit count.
788
789![Bit Offsets Example: Step 4](images/10-extension-bit-offset.example.frame.4.png)
790
791The ordering of bits within a byte, like byte ordering, is defined by the target
792architecture. A base type could be extended to specify bit ordering in addition
793to byte ordering.
794
795## Call Frame Information (CFI)
796
797DWARF defines call frame information (CFI) that can be used to virtually unwind
798the subprogram call stack. This involves determining the location where register
799values have been spilled. DWARF 5 limits these locations to either be registers
800or global memory. As shown in the earlier examples, heterogeneous devices may
801spill registers to parts of other registers, to non-global memory address
802spaces, or even a composite of different location kinds.
803
804Therefore, the extension extends the CFI rules to support any kind of location
805description, and operations to create locations in address spaces.
806
807## Objects Not In Byte Aligned Global Memory
808
809DWARF 5 only effectively supports byte aligned memory locations on the stack by
810using a global memory address as a proxy for a memory location description. This
811is a problem for attributes that define DWARF expressions that require the
812location of some source language entity that is not allocated in byte aligned
813global memory.
814
815For example, the DWARF expression of the DW_AT_data_member_location attribute is
816evaluated with an initial stack containing the location of a type instance
817object. That object could be located in a register, in a non-global memory
818address space, be described by a composite location description, or could even
819be an implicit location description.
820
821A similar problem exists for DWARF expressions that use the
822DW_OP_push_object_address operation. This operation pushes the location of a
823program object associated with the attribute that defines the expression.
824
825Allowing any kind of location description on the stack permits the DW_OP_call*
826operations to be used to factor the creation of location descriptions. The
827inputs and outputs of the call are passed on the stack. For example, on GPUs an
828expression can be defined to describe the effective PC of inactive lanes of SIMT
829execution. This is naturally done by composing the result of expressions for
830each nested control flow region. This can be done by making each control flow
831region have its own DWARF procedure, and then calling it from the expressions of
832the nested control flow regions. The alternative is to make each control flow
833region have the complete expression which results in much larger DWARF and is
834less convenient to generate.
835
836GPU compilers work hard to allocate objects in the larger number of registers to
837reduce memory accesses, they have to use different memory address spaces, and
838they perform optimizations that result in composites of these. Allowing
839operations to work with any kind of location description enables creating
840expressions that support all of these.
841
842Full general support for bit fields and implicit locations benefits
843optimizations on any target.
844
845## Higher Order Operations
846
847The generalization allows an elegant way to add higher order operations that
848create location descriptions out of other location descriptions in a general
849composable manner.
850
851For example, a DW_OP_extend operation could create a composite location
852description out of a location description, an element size, and an element
853count. The resulting composite would effectively be a vector of element count
854elements with each element being the same location description of the specified
855bit size.
856
857A DW_OP_select_bit_piece operation could create a composite location description
858out of two location descriptions, a bit mask value, and an element size. The
859resulting composite would effectively be a vector of elements, selecting from
860one of the two input locations according to the bit mask.
861
862These could be used in the expression of an attribute that computes the
863effective PC of lanes of SIMT execution. The vector result efficiently computes
864the PC for each SIMT lane at once. The mask could be the hardware execution mask
865register that controls which SIMT lanes are executing. For active divergent
866lanes the vector element would be the current PC, and for inactive divergent
867lanes the PC would correspond to the source language line at which the lane is
868logically positioned.
869
870Similarly, a DW_OP_overlay_piece operation could be defined that creates a
871composite location description out of two location descriptions, an offset
872value, and a size. The resulting composite would consist of parts that are
873equivalent to one of the location descriptions, but with the other location
874description replacing a slice defined by the offset and size. This could be used
875to efficiently express a source language array that has had a set of elements
876promoted into a vector register when executing a set of iterations of a loop in
877a SIMD manner.
878
879## Objects In Multiple Places
880
881A compiler may allocate a source variable in stack frame memory, but for some
882range of code may promote it to a register. If the generated code does not
883change the register value, then there is no need to save it back to memory.
884Effectively, during that range, the source variable is in both memory and a
885register. If a consumer, such as a debugger, allows the user to change the value
886of the source variable in that PC range, then it would need to change both
887places.
888
889DWARF 5 supports loclists which are able to specify the location of a source
890language entity is in different places at different PC locations. It can also
891express that a source language entity is in multiple places at the same time.
892
893DWARF 5 defines operation expressions and loclists separately. In general, this
894is adequate as non-memory location descriptions can only be computed as the last
895step of an expression evaluation.
896
897However, allowing location descriptions on the stack permits non-memory location
898descriptions to be used in the middle of expression evaluation. For example, the
899DW_OP_call* and DW_OP_implicit_pointer operations can result in evaluating the
900expression of a DW_AT_location attribute of a DIE. The DW_AT_location attribute
901allows the loclist form. So the result could include multiple location
902descriptions.
903
904Similarly, the DWARF expression associated with attributes such as
905DW_AT_data_member_location that are evaluated with an initial stack containing a
906location description, or a DWARF operation expression that uses the
907DW_OP_push_object_address operation, may want to act on the result of another
908expression that returned a location description involving multiple places.
909
910Therefore, the extension needs to define how expression operations that use those
911results will behave. The extension does this by generalizing the expression stack
912to allow an entry to be one or more single location descriptions. In doing this,
913it unifies the definitions of DWARF operation expressions and loclist
914expressions in a natural way.
915
916All operations that act on location descriptions are extended to act on multiple
917single location descriptions. For example, the DW_OP_offset operation adds the
918offset to each single location description. The DW_OP_deref* operations simply
919read the storage of one of the single location descriptions, since multiple
920single location descriptions must all hold the same value. Similarly, if the
921evaluation of a DWARF expression results in multiple single location
922descriptions, the consumer can ensure any updates are done to all of them, and
923any reads can use any one of them.
924
925# Conclusion
926
927A strength of DWARF is that it has generally sought to provide generalized
928composable solutions that address many problems, rather than solutions that only
929address one-off issues. This extension attempts to follow that tradition by
930defining a backwards compatible composable generalization that can address a
931significant family of issues. It addresses the specific issues present for
932heterogeneous computing devices, provides benefits for non-heterogeneous
933devices, and can help address a number of other previously reported issues.
934
935# Further Information
936
937The following references provide additional information on the extension.
938
939Slides and a video of a presentation at the Linux Plumbers Conference 2021
940related to this extension are available.
941
942The LLVM compiler extension includes possible normative text changes for this
943extension as well as the operations mentioned in the motivating examples. It
944also covers other extensions needed for heterogeneous devices.
945
946- DWARF extensions for optimized SIMT/SIMD (GPU) debugging - Linux Plumbers Conference 2021
947  - [Video](https://www.youtube.com/watch?v=QiR0ra0ymEY&t=10015s)
948  - [Slides](https://linuxplumbersconf.org/event/11/contributions/1012/attachments/798/1505/DWARF_Extensions_for_Optimized_SIMT-SIMD_GPU_Debugging-LPC2021.pdf)
949- [DWARF Extensions For Heterogeneous Debugging](https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html)
950