1# Buffer Deallocation - Internals
2
3This section covers the internal functionality of the BufferDeallocation
4transformation. The transformation consists of several passes. The main pass
5called BufferDeallocation can be applied via “-buffer-deallocation” on MLIR
6programs.
7
8## Requirements
9
10In order to use BufferDeallocation on an arbitrary dialect, several control-flow
11interfaces have to be implemented when using custom operations. This is
12particularly important to understand the implicit control-flow dependencies
13between different parts of the input program. Without implementing the following
14interfaces, control-flow relations cannot be discovered properly and the
15resulting program can become invalid:
16
17*   Branch-like terminators should implement the `BranchOpInterface` to query
18    and manipulate associated operands.
19*   Operations involving structured control flow have to implement the
20    `RegionBranchOpInterface` to model inter-region control flow.
21*   Terminators yielding values to their parent operation (in particular in the
22    scope of nested regions within `RegionBranchOpInterface`-based operations),
23    should implement the `ReturnLike` trait to represent logical “value
24    returns”.
25
26Example dialects that are fully compatible are the “std” and “scf” dialects with
27respect to all implemented interfaces.
28
29During Bufferization, we convert immutable value types (tensors) to mutable
30types (memref). This conversion is done in several steps and in all of these
31steps the IR has to fulfill SSA like properties. The usage of memref has to be
32in the following consecutive order: allocation, write-buffer, read- buffer. In
33this case, there are only buffer reads allowed after the initial full buffer
34write is done. In particular, there must be no partial write to a buffer after
35the initial write has been finished. However, partial writes in the initializing
36is allowed (fill buffer step by step in a loop e.g.). This means, all buffer
37writes needs to dominate all buffer reads.
38
39Example for breaking the invariant:
40
41```mlir
42func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>) {
43  %0 = memref.alloc() : memref<2xf32>
44  cf.cond_br %arg0, ^bb1, ^bb2
45^bb1:
46  cf.br ^bb3()
47^bb2:
48  partial_write(%0, %0)
49  cf.br ^bb3()
50^bb3():
51  test.copy(%0, %arg1) : (memref<2xf32>, memref<2xf32>) -> ()
52  return
53}
54```
55
56The maintenance of the SSA like properties is only needed in the bufferization
57process. Afterwards, for example in optimization processes, the property is no
58longer needed.
59
60## Detection of Buffer Allocations
61
62The first step of the BufferDeallocation transformation is to identify
63manageable allocation operations that implement the `SideEffects` interface.
64Furthermore, these ops need to apply the effect `MemoryEffects::Allocate` to a
65particular result value while not using the resource
66`SideEffects::AutomaticAllocationScopeResource` (since it is currently reserved
67for allocations, like `Alloca` that will be automatically deallocated by a
68parent scope). Allocations that have not been detected in this phase will not be
69tracked internally, and thus, not deallocated automatically. However,
70BufferDeallocation is fully compatible with “hybrid” setups in which tracked and
71untracked allocations are mixed:
72
73```mlir
74func.func @mixedAllocation(%arg0: i1) {
75   %0 = memref.alloca() : memref<2xf32>  // aliases: %2
76   %1 = memref.alloc() : memref<2xf32>  // aliases: %2
77   cf.cond_br %arg0, ^bb1, ^bb2
78^bb1:
79  use(%0)
80  cf.br ^bb3(%0 : memref<2xf32>)
81^bb2:
82  use(%1)
83  cf.br ^bb3(%1 : memref<2xf32>)
84^bb3(%2: memref<2xf32>):
85  ...
86}
87```
88
89Example of using a conditional branch with alloc and alloca. BufferDeallocation
90can detect and handle the different allocation types that might be intermixed.
91
92Note: the current version does not support allocation operations returning
93multiple result buffers.
94
95## Conversion from AllocOp to AllocaOp
96
97The PromoteBuffersToStack-pass converts AllocOps to AllocaOps, if possible. In
98some cases, it can be useful to use such stack-based buffers instead of
99heap-based buffers. The conversion is restricted to several constraints like:
100
101*   Control flow
102*   Buffer Size
103*   Dynamic Size
104
105If a buffer is leaving a block, we are not allowed to convert it into an alloca.
106If the size of the buffer is large, we could convert it, but regarding stack
107overflow, it makes sense to limit the size of these buffers and only convert
108small ones. The size can be set via a pass option. The current default value is
1091KB. Furthermore, we can not convert buffers with dynamic size, since the
110dimension is not known a priori.
111
112## Movement and Placement of Allocations
113
114Using the buffer hoisting pass, all buffer allocations are moved as far upwards
115as possible in order to group them and make upcoming optimizations easier by
116limiting the search space. Such a movement is shown in the following graphs. In
117addition, we are able to statically free an alloc, if we move it into a
118dominator of all of its uses. This simplifies further optimizations (e.g. buffer
119fusion) in the future. However, movement of allocations is limited by external
120data dependencies (in particular in the case of allocations of dynamically
121shaped types). Furthermore, allocations can be moved out of nested regions, if
122necessary. In order to move allocations to valid locations with respect to their
123uses only, we leverage Liveness information.
124
125The following code snippets shows a conditional branch before running the
126BufferHoisting pass:
127
128![branch_example_pre_move](/includes/img/branch_example_pre_move.svg)
129
130```mlir
131func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
132  cf.cond_br %arg0, ^bb1, ^bb2
133^bb1:
134  cf.br ^bb3(%arg1 : memref<2xf32>)
135^bb2:
136  %0 = memref.alloc() : memref<2xf32>  // aliases: %1
137  use(%0)
138  cf.br ^bb3(%0 : memref<2xf32>)
139^bb3(%1: memref<2xf32>):  // %1 could be %0 or %arg1
140  test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
141  return
142}
143```
144
145Applying the BufferHoisting pass on this program results in the following piece
146of code:
147
148![branch_example_post_move](/includes/img/branch_example_post_move.svg)
149
150```mlir
151func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
152  %0 = memref.alloc() : memref<2xf32>  // moved to bb0
153  cf.cond_br %arg0, ^bb1, ^bb2
154^bb1:
155  cf.br ^bb3(%arg1 : memref<2xf32>)
156^bb2:
157   use(%0)
158   cf.br ^bb3(%0 : memref<2xf32>)
159^bb3(%1: memref<2xf32>):
160  test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
161  return
162}
163```
164
165The alloc is moved from bb2 to the beginning and it is passed as an argument to
166bb3.
167
168The following example demonstrates an allocation using dynamically shaped types.
169Due to the data dependency of the allocation to %0, we cannot move the
170allocation out of bb2 in this case:
171
172```mlir
173func.func @condBranchDynamicType(
174  %arg0: i1,
175  %arg1: memref<?xf32>,
176  %arg2: memref<?xf32>,
177  %arg3: index) {
178  cf.cond_br %arg0, ^bb1, ^bb2(%arg3: index)
179^bb1:
180  cf.br ^bb3(%arg1 : memref<?xf32>)
181^bb2(%0: index):
182  %1 = memref.alloc(%0) : memref<?xf32>   // cannot be moved upwards to the data
183                                   // dependency to %0
184  use(%1)
185  cf.br ^bb3(%1 : memref<?xf32>)
186^bb3(%2: memref<?xf32>):
187  test.copy(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
188  return
189}
190```
191
192## Introduction of Clones
193
194In order to guarantee that all allocated buffers are freed properly, we have to
195pay attention to the control flow and all potential aliases a buffer allocation
196can have. Since not all allocations can be safely freed with respect to their
197aliases (see the following code snippet), it is often required to introduce
198copies to eliminate them. Consider the following example in which the
199allocations have already been placed:
200
201```mlir
202func.func @branch(%arg0: i1) {
203  %0 = memref.alloc() : memref<2xf32>  // aliases: %2
204  cf.cond_br %arg0, ^bb1, ^bb2
205^bb1:
206  %1 = memref.alloc() : memref<2xf32>  // resides here for demonstration purposes
207                                // aliases: %2
208  cf.br ^bb3(%1 : memref<2xf32>)
209^bb2:
210  use(%0)
211  cf.br ^bb3(%0 : memref<2xf32>)
212^bb3(%2: memref<2xf32>):
213214  return
215}
216```
217
218The first alloc can be safely freed after the live range of its post-dominator
219block (bb3). The alloc in bb1 has an alias %2 in bb3 that also keeps this buffer
220alive until the end of bb3. Since we cannot determine the actual branches that
221will be taken at runtime, we have to ensure that all buffers are freed correctly
222in bb3 regardless of the branches we will take to reach the exit block. This
223makes it necessary to introduce a copy for %2, which allows us to free %alloc0
224in bb0 and %alloc1 in bb1. Afterwards, we can continue processing all aliases of
225%2 (none in this case) and we can safely free %2 at the end of the sample
226program. This sample demonstrates that not all allocations can be safely freed
227in their associated post-dominator blocks. Instead, we have to pay attention to
228all of their aliases.
229
230Applying the BufferDeallocation pass to the program above yields the following
231result:
232
233```mlir
234func.func @branch(%arg0: i1) {
235  %0 = memref.alloc() : memref<2xf32>
236  cf.cond_br %arg0, ^bb1, ^bb2
237^bb1:
238  %1 = memref.alloc() : memref<2xf32>
239  %3 = bufferization.clone %1 : (memref<2xf32>) -> (memref<2xf32>)
240  memref.dealloc %1 : memref<2xf32> // %1 can be safely freed here
241  cf.br ^bb3(%3 : memref<2xf32>)
242^bb2:
243  use(%0)
244  %4 = bufferization.clone %0 : (memref<2xf32>) -> (memref<2xf32>)
245  cf.br ^bb3(%4 : memref<2xf32>)
246^bb3(%2: memref<2xf32>):
247248  memref.dealloc %2 : memref<2xf32> // free temp buffer %2
249  memref.dealloc %0 : memref<2xf32> // %0 can be safely freed here
250  return
251}
252```
253
254Note that a temporary buffer for %2 was introduced to free all allocations
255properly. Note further that the unnecessary allocation of %3 can be easily
256removed using one of the post-pass transformations or the canonicalization pass.
257
258The presented example also works with dynamically shaped types.
259
260BufferDeallocation performs a fix-point iteration taking all aliases of all
261tracked allocations into account. We initialize the general iteration process
262using all tracked allocations and their associated aliases. As soon as we
263encounter an alias that is not properly dominated by our allocation, we mark
264this alias as *critical* (needs to be freed and tracked by the internal
265fix-point iteration). The following sample demonstrates the presence of critical
266and non-critical aliases:
267
268![nested_branch_example_pre_move](/includes/img/nested_branch_example_pre_move.svg)
269
270```mlir
271func.func @condBranchDynamicTypeNested(
272  %arg0: i1,
273  %arg1: memref<?xf32>,  // aliases: %3, %4
274  %arg2: memref<?xf32>,
275  %arg3: index) {
276  cf.cond_br %arg0, ^bb1, ^bb2(%arg3: index)
277^bb1:
278  cf.br ^bb6(%arg1 : memref<?xf32>)
279^bb2(%0: index):
280  %1 = memref.alloc(%0) : memref<?xf32>   // cannot be moved upwards due to the data
281                                   // dependency to %0
282                                   // aliases: %2, %3, %4
283  use(%1)
284  cf.cond_br %arg0, ^bb3, ^bb4
285^bb3:
286  cf.br ^bb5(%1 : memref<?xf32>)
287^bb4:
288  cf.br ^bb5(%1 : memref<?xf32>)
289^bb5(%2: memref<?xf32>):  // non-crit. alias of %1, since %1 dominates %2
290  cf.br ^bb6(%2 : memref<?xf32>)
291^bb6(%3: memref<?xf32>):  // crit. alias of %arg1 and %2 (in other words %1)
292  cf.br ^bb7(%3 : memref<?xf32>)
293^bb7(%4: memref<?xf32>):  // non-crit. alias of %3, since %3 dominates %4
294  test.copy(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
295  return
296}
297```
298
299Applying BufferDeallocation yields the following output:
300
301![nested_branch_example_post_move](/includes/img/nested_branch_example_post_move.svg)
302
303```mlir
304func.func @condBranchDynamicTypeNested(
305  %arg0: i1,
306  %arg1: memref<?xf32>,
307  %arg2: memref<?xf32>,
308  %arg3: index) {
309  cf.cond_br %arg0, ^bb1, ^bb2(%arg3 : index)
310^bb1:
311  // temp buffer required due to alias %3
312  %5 = bufferization.clone %arg1 : (memref<?xf32>) -> (memref<?xf32>)
313  cf.br ^bb6(%5 : memref<?xf32>)
314^bb2(%0: index):
315  %1 = memref.alloc(%0) : memref<?xf32>
316  use(%1)
317  cf.cond_br %arg0, ^bb3, ^bb4
318^bb3:
319  cf.br ^bb5(%1 : memref<?xf32>)
320^bb4:
321  cf.br ^bb5(%1 : memref<?xf32>)
322^bb5(%2: memref<?xf32>):
323  %6 = bufferization.clone %1 : (memref<?xf32>) -> (memref<?xf32>)
324  memref.dealloc %1 : memref<?xf32>
325  cf.br ^bb6(%6 : memref<?xf32>)
326^bb6(%3: memref<?xf32>):
327  cf.br ^bb7(%3 : memref<?xf32>)
328^bb7(%4: memref<?xf32>):
329  test.copy(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
330  memref.dealloc %3 : memref<?xf32>  // free %3, since %4 is a non-crit. alias of %3
331  return
332}
333```
334
335Since %3 is a critical alias, BufferDeallocation introduces an additional
336temporary copy in all predecessor blocks. %3 has an additional (non-critical)
337alias %4 that extends the live range until the end of bb7. Therefore, we can
338free %3 after its last use, while taking all aliases into account. Note that %4
339does not need to be freed, since we did not introduce a copy for it.
340
341The actual introduction of buffer copies is done after the fix-point iteration
342has been terminated and all critical aliases have been detected. A critical
343alias can be either a block argument or another value that is returned by an
344operation. Copies for block arguments are handled by analyzing all predecessor
345blocks. This is primarily done by querying the `BranchOpInterface` of the
346associated branch terminators that can jump to the current block. Consider the
347following example which involves a simple branch and the critical block argument
348%2:
349
350```mlir
351  custom.br ^bb1(..., %0, : ...)
352  ...
353  custom.br ^bb1(..., %1, : ...)
354  ...
355^bb1(%2: memref<2xf32>):
356  ...
357```
358
359The `BranchOpInterface` allows us to determine the actual values that will be
360passed to block bb1 and its argument %2 by analyzing its predecessor blocks.
361Once we have resolved the values %0 and %1 (that are associated with %2 in this
362sample), we can introduce a temporary buffer and clone its contents into the new
363buffer. Afterwards, we rewire the branch operands to use the newly allocated
364buffer instead. However, blocks can have implicitly defined predecessors by
365parent ops that implement the `RegionBranchOpInterface`. This can be the case if
366this block argument belongs to the entry block of a region. In this setting, we
367have to identify all predecessor regions defined by the parent operation. For
368every region, we need to get all terminator operations implementing the
369`ReturnLike` trait, indicating that they can branch to our current block.
370Finally, we can use a similar functionality as described above to add the
371temporary copy. This time, we can modify the terminator operands directly
372without touching a high-level interface.
373
374Consider the following inner-region control-flow sample that uses an imaginary
375custom.region_if” operation. It either executes the “then” or “else” region and
376always continues to the “join” region. The “custom.region_if_yield” operation
377returns a result to the parent operation. This sample demonstrates the use of
378the `RegionBranchOpInterface` to determine predecessors in order to infer the
379high-level control flow:
380
381```mlir
382func.func @inner_region_control_flow(
383  %arg0 : index,
384  %arg1 : index) -> memref<?x?xf32> {
385  %0 = memref.alloc(%arg0, %arg0) : memref<?x?xf32>
386  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
387   then(%arg2 : memref<?x?xf32>) {  // aliases: %arg4, %1
388    custom.region_if_yield %arg2 : memref<?x?xf32>
389   } else(%arg3 : memref<?x?xf32>) {  // aliases: %arg4, %1
390    custom.region_if_yield %arg3 : memref<?x?xf32>
391   } join(%arg4 : memref<?x?xf32>) {  // aliases: %1
392    custom.region_if_yield %arg4 : memref<?x?xf32>
393   }
394  return %1 : memref<?x?xf32>
395}
396```
397
398![region_branch_example_pre_move](/includes/img/region_branch_example_pre_move.svg)
399
400Non-block arguments (other values) can become aliases when they are returned by
401dialect-specific operations. BufferDeallocation supports this behavior via the
402`RegionBranchOpInterface`. Consider the following example that uses an “scf.if403operation to determine the value of %2 at runtime which creates an alias:
404
405```mlir
406func.func @nested_region_control_flow(%arg0 : index, %arg1 : index) -> memref<?x?xf32> {
407  %0 = arith.cmpi "eq", %arg0, %arg1 : index
408  %1 = memref.alloc(%arg0, %arg0) : memref<?x?xf32>
409  %2 = scf.if %0 -> (memref<?x?xf32>) {
410    scf.yield %1 : memref<?x?xf32>   // %2 will be an alias of %1
411  } else {
412    %3 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>  // nested allocation in a div.
413                                                // branch
414    use(%3)
415    scf.yield %1 : memref<?x?xf32>   // %2 will be an alias of %1
416  }
417  return %2 : memref<?x?xf32>
418}
419```
420
421In this example, a dealloc is inserted to release the buffer within the else
422block since it cannot be accessed by the remainder of the program. Accessing the
423`RegionBranchOpInterface`, allows us to infer that %2 is a non-critical alias of
424%1 which does not need to be tracked.
425
426```mlir
427func.func @nested_region_control_flow(%arg0: index, %arg1: index) -> memref<?x?xf32> {
428    %0 = arith.cmpi "eq", %arg0, %arg1 : index
429    %1 = memref.alloc(%arg0, %arg0) : memref<?x?xf32>
430    %2 = scf.if %0 -> (memref<?x?xf32>) {
431      scf.yield %1 : memref<?x?xf32>
432    } else {
433      %3 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>
434      use(%3)
435      memref.dealloc %3 : memref<?x?xf32>  // %3 can be safely freed here
436      scf.yield %1 : memref<?x?xf32>
437    }
438    return %2 : memref<?x?xf32>
439}
440```
441
442Analogous to the previous case, we have to detect all terminator operations in
443all attached regions of “scf.if” that provides a value to its parent operation
444(in this sample via scf.yield). Querying the `RegionBranchOpInterface` allows us
445to determine the regions that “return” a result to their parent operation. Like
446before, we have to update all `ReturnLike` terminators as described above.
447Reconsider a slightly adapted version of the “custom.region_if” example from
448above that uses a nested allocation:
449
450```mlir
451func.func @inner_region_control_flow_div(
452  %arg0 : index,
453  %arg1 : index) -> memref<?x?xf32> {
454  %0 = memref.alloc(%arg0, %arg0) : memref<?x?xf32>
455  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
456   then(%arg2 : memref<?x?xf32>) {  // aliases: %arg4, %1
457    custom.region_if_yield %arg2 : memref<?x?xf32>
458   } else(%arg3 : memref<?x?xf32>) {
459    %2 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>  // aliases: %arg4, %1
460    custom.region_if_yield %2 : memref<?x?xf32>
461   } join(%arg4 : memref<?x?xf32>) {  // aliases: %1
462    custom.region_if_yield %arg4 : memref<?x?xf32>
463   }
464  return %1 : memref<?x?xf32>
465}
466```
467
468Since the allocation %2 happens in a divergent branch and cannot be safely
469deallocated in a post-dominator, %arg4 will be considered a critical alias.
470Furthermore, %arg4 is returned to its parent operation and has an alias %1. This
471causes BufferDeallocation to introduce additional copies:
472
473```mlir
474func.func @inner_region_control_flow_div(
475  %arg0 : index,
476  %arg1 : index) -> memref<?x?xf32> {
477  %0 = memref.alloc(%arg0, %arg0) : memref<?x?xf32>
478  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
479   then(%arg2 : memref<?x?xf32>) {
480    %4 = bufferization.clone %arg2 : (memref<?x?xf32>) -> (memref<?x?xf32>)
481    custom.region_if_yield %4 : memref<?x?xf32>
482   } else(%arg3 : memref<?x?xf32>) {
483    %2 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>
484    %5 = bufferization.clone %2 : (memref<?x?xf32>) -> (memref<?x?xf32>)
485    memref.dealloc %2 : memref<?x?xf32>
486    custom.region_if_yield %5 : memref<?x?xf32>
487   } join(%arg4: memref<?x?xf32>) {
488    %4 = bufferization.clone %arg4 : (memref<?x?xf32>) -> (memref<?x?xf32>)
489    memref.dealloc %arg4 : memref<?x?xf32>
490    custom.region_if_yield %4 : memref<?x?xf32>
491   }
492  memref.dealloc %0 : memref<?x?xf32>  // %0 can be safely freed here
493  return %1 : memref<?x?xf32>
494}
495```
496
497## Placement of Deallocs
498
499After introducing allocs and copies, deallocs have to be placed to free
500allocated memory and avoid memory leaks. The deallocation needs to take place
501after the last use of the given value. The position can be determined by
502calculating the common post-dominator of all values using their remaining
503non-critical aliases. A special-case is the presence of back edges: since such
504edges can cause memory leaks when a newly allocated buffer flows back to another
505part of the program. In these cases, we need to free the associated buffer
506instances from the previous iteration by inserting additional deallocs.
507
508Consider the following “scf.for” use case containing a nested structured
509control-flow if:
510
511```mlir
512func.func @loop_nested_if(
513  %lb: index,
514  %ub: index,
515  %step: index,
516  %buf: memref<2xf32>,
517  %res: memref<2xf32>) {
518  %0 = scf.for %i = %lb to %ub step %step
519    iter_args(%iterBuf = %buf) -> memref<2xf32> {
520    %1 = arith.cmpi "eq", %i, %ub : index
521    %2 = scf.if %1 -> (memref<2xf32>) {
522      %3 = memref.alloc() : memref<2xf32>  // makes %2 a critical alias due to a
523                                    // divergent allocation
524      use(%3)
525      scf.yield %3 : memref<2xf32>
526    } else {
527      scf.yield %iterBuf : memref<2xf32>
528    }
529    scf.yield %2 : memref<2xf32>
530  }
531  test.copy(%0, %res) : (memref<2xf32>, memref<2xf32>) -> ()
532  return
533}
534```
535
536In this example, the *then* branch of the nested “scf.if” operation returns a
537newly allocated buffer.
538
539Since this allocation happens in the scope of a divergent branch, %2 becomes a
540critical alias that needs to be handled. As before, we have to insert additional
541copies to eliminate this alias using copies of %3 and %iterBuf. This guarantees
542that %2 will be a newly allocated buffer that is returned in each iteration.
543However, “returning” %2 to its alias %iterBuf turns %iterBuf into a critical
544alias as well. In other words, we have to create a copy of %2 to pass it to
545%iterBuf. Since this jump represents a back edge, and %2 will always be a new
546buffer, we have to free the buffer from the previous iteration to avoid memory
547leaks:
548
549```mlir
550func.func @loop_nested_if(
551  %lb: index,
552  %ub: index,
553  %step: index,
554  %buf: memref<2xf32>,
555  %res: memref<2xf32>) {
556  %4 = bufferization.clone %buf : (memref<2xf32>) -> (memref<2xf32>)
557  %0 = scf.for %i = %lb to %ub step %step
558    iter_args(%iterBuf = %4) -> memref<2xf32> {
559    %1 = arith.cmpi "eq", %i, %ub : index
560    %2 = scf.if %1 -> (memref<2xf32>) {
561      %3 = memref.alloc() : memref<2xf32> // makes %2 a critical alias
562      use(%3)
563      %5 = bufferization.clone %3 : (memref<2xf32>) -> (memref<2xf32>)
564      memref.dealloc %3 : memref<2xf32>
565      scf.yield %5 : memref<2xf32>
566    } else {
567      %6 = bufferization.clone %iterBuf : (memref<2xf32>) -> (memref<2xf32>)
568      scf.yield %6 : memref<2xf32>
569    }
570    %7 = bufferization.clone %2 : (memref<2xf32>) -> (memref<2xf32>)
571    memref.dealloc %2 : memref<2xf32>
572    memref.dealloc %iterBuf : memref<2xf32> // free backedge iteration variable
573    scf.yield %7 : memref<2xf32>
574  }
575  test.copy(%0, %res) : (memref<2xf32>, memref<2xf32>) -> ()
576  memref.dealloc %0 : memref<2xf32> // free temp copy %0
577  return
578}
579```
580
581Example for loop-like control flow. The CFG contains back edges that have to be
582handled to avoid memory leaks. The bufferization is able to free the backedge
583iteration variable %iterBuf.
584
585## Private Analyses Implementations
586
587The BufferDeallocation transformation relies on one primary control-flow
588analysis: BufferPlacementAliasAnalysis. Furthermore, we also use dominance and
589liveness to place and move nodes. The liveness analysis determines the live
590range of a given value. Within this range, a value is alive and can or will be
591used in the course of the program. After this range, the value is dead and can
592be discarded - in our case, the buffer can be freed. To place the allocs, we
593need to know from which position a value will be alive. The allocs have to be
594placed in front of this position. However, the most important analysis is the
595alias analysis that is needed to introduce copies and to place all
596deallocations.
597
598# Post Phase
599
600In order to limit the complexity of the BufferDeallocation transformation, some
601tiny code-polishing/optimization transformations are not applied on-the-fly
602during placement. Currently, a canonicalization pattern is added to the clone
603operation to reduce the appearance of unnecessary clones.
604
605Note: further transformations might be added to the post-pass phase in the
606future.
607
608## Clone Canonicalization
609
610During placement of clones it may happen, that unnecessary clones are inserted.
611If these clones appear with their corresponding dealloc operation within the
612same block, we can use the canonicalizer to remove these unnecessary operations.
613Note, that this step needs to take place after the insertion of clones and
614deallocs in the buffer deallocation step. The canonicalization inludes both, the
615newly created target value from the clone operation and the source operation.
616
617## Canonicalization of the Source Buffer of the Clone Operation
618
619In this case, the source of the clone operation can be used instead of its
620target. The unused allocation and deallocation operations that are defined for
621this clone operation are also removed. Here is a working example generated by
622the BufferDeallocation pass that allocates a buffer with dynamic size. A deeper
623analysis of this sample reveals that the highlighted operations are redundant
624and can be removed.
625
626```mlir
627func.func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> {
628  %1 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>
629  %2 = bufferization.clone %1 : (memref<?x?xf32>) -> (memref<?x?xf32>)
630  memref.dealloc %1 : memref<?x?xf32>
631  return %2 : memref<?x?xf32>
632}
633```
634
635Will be transformed to:
636
637```mlir
638func.func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> {
639  %1 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>
640  return %1 : memref<?x?xf32>
641}
642```
643
644In this case, the additional copy %2 can be replaced with its original source
645buffer %1. This also applies to the associated dealloc operation of %1.
646
647## Canonicalization of the Target Buffer of the Clone Operation
648
649In this case, the target buffer of the clone operation can be used instead of
650its source. The unused deallocation operation that is defined for this clone
651operation is also removed.
652
653Consider the following example where a generic test operation writes the result
654to %temp and then copies %temp to %result. However, these two operations can be
655merged into a single step. Canonicalization removes the clone operation and
656%temp, and replaces the uses of %temp with %result:
657
658```mlir
659func.func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){
660  %temp = memref.alloc() : memref<2xf32>
661  test.generic {
662    args_in = 1 : i64,
663    args_out = 1 : i64,
664    indexing_maps = [#map0, #map0],
665    iterator_types = ["parallel"]} %arg0, %temp {
666  ^bb0(%gen2_arg0: f32, %gen2_arg1: f32):
667    %tmp2 = math.exp %gen2_arg0 : f32
668    test.yield %tmp2 : f32
669  }: memref<2xf32>, memref<2xf32>
670  %result = bufferization.clone %temp : (memref<2xf32>) -> (memref<2xf32>)
671  memref.dealloc %temp : memref<2xf32>
672  return
673}
674```
675
676Will be transformed to:
677
678```mlir
679func.func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){
680  test.generic {
681    args_in = 1 : i64,
682    args_out = 1 : i64,
683    indexing_maps = [#map0, #map0],
684    iterator_types = ["parallel"]} %arg0, %result {
685  ^bb0(%gen2_arg0: f32, %gen2_arg1: f32):
686    %tmp2 = math.exp %gen2_arg0 : f32
687    test.yield %tmp2 : f32
688  }: memref<2xf32>, memref<2xf32>
689  return
690}
691```
692
693## Known Limitations
694
695BufferDeallocation introduces additional clones from “memref” dialect
696(“bufferization.clone”). Analogous, all deallocations use the “memref”
697dialect-free operation “memref.dealloc”. The actual copy process is realized
698using “test.copy”. Furthermore, buffers are essentially immutable after their
699creation in a block. Another limitations are known in the case using
700unstructered control flow.
701