1# Bufferization
2
3[TOC]
4
5## Overview
6
7Bufferization in MLIR is the process of converting ops with `tensor` semantics
8to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes
9an entire program in a single pass (*One-Shot Bufferize*). This infrastructure
10bufferizes all ops that implement the
11[`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
12can be bufferized.
13
14MLIR has an older bufferization infrastructure built around
15[dialect conversion](DialectConversion.md). Most dialect conversion
16bufferization patterns have been migrated to One-Shot Bufferize, but some
17functionality such as function boundary bufferization still depends on dialect
18conversion and its type converter. New projects should use One-Shot Bufferize,
19as the dialect conversion-based bufferization will eventually be deprecated.
20Moreover, One-Shot Bufferize results in better bufferization with fewer memory
21allocations and buffer copies. This documentation is mostly about One-Shot
22Bufferize, but also describes how to gradually migrate a project from dialect
23conversion-based bufferization to One-Shot Bufferize.
24
25## What is One-Shot Bufferize?
26
27One-Shot Bufferize is a new tensor bufferization pass designed for IR in
28[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
29and with aggressive in-place bufferization.
30
31One-Shot Bufferize is:
32
33*   **Monolithic**: A single MLIR pass does the entire work, whereas the
34    previous bufferization in MLIR was split across multiple passes residing in
35    different dialects. In One-Shot Bufferize, `BufferizableOpInterface`
36    implementations are spread across different dialects.
37
38*   A **whole-function at a time analysis**. In-place bufferization decisions
39    are made by analyzing SSA use-def chains on tensors. Op interface
40    implementations not only provide the rewrite logic from tensor ops to memref
41    ops, but also helper methods for One-Shot Bufferize's analysis to query
42    information about an op's bufferization/memory semantics.
43
44*   **Extensible** via an op interface: All ops that implement
45    `BufferizableOpInterface` can be bufferized.
46
47*   **2-Pass**: Bufferization is internally broken down into 2 steps: First,
48    analyze the entire IR and make bufferization decisions. Then, bufferize
49    (rewrite) the IR. The analysis has access to exact SSA use-def information.
50    It incrementally builds alias and equivalence sets and does not rely on a
51    posteriori-alias analysis from preallocated memory.
52
53*   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
54    whether a tensor OpOperand must be copied or not. Heuristics determine the
55    order of analysis.
56
57*   **Modular**: The current One-Shot Analysis can be replaced with a different
58    analysis. The result of the analysis are queried by the bufferization via
59    `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
60    of `AnalysisState` that implements a small number virtual functions can
61    serve as a custom analysis. It is even possible to run One-Shot Bufferize
62    without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
63    Bufferize behaves exactly like the old dialect conversion-based
64    bufferization (i.e., copy every buffer before writing to it).
65
66To reduce complexity, One-Shot Bufferize should be
67[run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
68typically as one of the last steps right before lowering memref ops. Many
69transformations are easier in tensor land; e.g., tile/fuse/… on tensors first,
70then bufferize the remaining IR.
71
72From an architecture perspective, One-Shot Bufferize consists of
73[BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
74(and its implementations) and an
75[analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164)
76of tensor SSA values that decides if a buffer can be used directly or must be
77copied. The [bufferize] method of the op interface inspects analysis results and
78rewrites tensor ops into memref ops.
79
80## Goals of Bufferization
81
82The high-level goal of every bufferization technique is to: 1. Use as little
83memory as possible. 2. Copy as little memory as possible.
84
85This implies reusing already allocated buffers when possible, turning
86bufferization into an algorithmically complex problem with similarities to
87register allocation.
88
89Depending on the concrete use case, there may be additional bufferization
90requirements. If the contents of a buffer are expensive to compute, there could
91be a tradeoff between *recomputation* and *compute once and copy*. On the
92contrary, it may not even be possible to allocate new buffers at runtime on some
93architectures.
94
95## Destination-Passing Style
96
97Bufferization is an algorithmically complex problem. Given an op with a tensor
98result, bufferization has to choose a memref buffer in which the result can be
99stored. It is always safe to allocate a brand new buffer, but such a
100bufferization strategy would be unacceptable for high-performance codegen. When
101choosing an already existing buffer, we must be careful not to accidentally
102overwrite data that is still needed later in the program.
103
104To simplify this problem, One-Shot Bufferize was designed for ops that are in
105*destination-passing style*. For every tensor result, such ops have a tensor
106operand, who's buffer could be for storing the result of the op in the absence
107of other conflicts. We call such tensor operands the *destination*.
108
109As an example, consider the following op: `%0 = tensor.insert %cst into
110%t[%idx] : tensor<?xf32>`
111
112`%t` is the destination in this example. When choosing a buffer for the result
113`%0`, One-Shot Bufferize considers only two options:
114
1151.  buffer(`%0`) = buffer(`%t`).
1162.  buffer(`%0`) is a newly allocated buffer.
117
118There may be other buffers in the same function that could potentially be used
119for buffer(`%0`), but those are not considered by One-Shot Bufferize to keep the
120bufferization simple. One-Shot Bufferize could be extended to consider such
121buffers in the future to achieve a better quality of bufferization.
122
123Tensor ops that are not in destination-passing style always bufferize to a
124memory allocation. E.g.:
125
126```mlir
127%0 = tensor.generate %sz {
128^bb0(%i : index):
129  %cst = arith.constant 0.0 : f32
130  tensor.yield %cst : f32
131} : tensor<?xf32>
132```
133
134The result of `tensor.generate` does not have a "destination", so bufferization
135allocates a new buffer. This could be avoided by choosing an op such as
136`linalg.generic`, which can express the same computation with a destination
137("out") tensor:
138
139```mlir
140#map = affine_map<(i) -> (i)>
141%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
142                    outs(%t : tensor<?xf32>) {
143  ^bb0(%arg0 : f32):
144    %cst = arith.constant 0.0 : f32
145    linalg.yield %cst : f32
146} -> tensor<?xf32>
147```
148
149At first glance, the above `linalg.generic` op may not seem very useful because
150the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
151operand in the first place? As an example, this can be useful for overwriting a
152slice of a tensor:
153
154```mlir
155%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
156%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
157%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
158    : tensor<?xf32> into tensor<?xf32>
159```
160
161The above example bufferizes to a `memref.subview`, followed by a
162"`linalg.generic` on memrefs" that overwrites the memory of the subview. The
163`tensor.insert_slice` bufferizes to a no-op (in the absence of RaW conflicts
164such as a subsequent read of `%s`).
165
166RaW conflicts are detected with an analysis of SSA use-def chains (details
167later). One-Shot Bufferize works best if there is a single SSA use-def chain,
168where the result of a tensor op is the "destination" operand of the next tensor
169ops, e.g.:
170
171```mlir
172%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
173%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
174%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
175```
176
177Buffer copies are likely inserted if the SSA use-def chain splits at some point,
178e.g.:
179
180```mlir
181%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
182%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
183%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
184```
185
186One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that
187print the results of the analysis and explain to the user why buffer copies were
188inserted.
189
190## Using One-Shot Bufferize
191
192MLIR provides a pass
193[`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
194that performs an analysis and bufferizes all ops with tensor semantics that
195implement `BufferizableOpInterface`. For modularity reasons, these op interface
196implementations are typically external models that live in a dialect's
197"Transforms" build unit. (External models are a mechanism for implementing an op
198interface in a different build unit.) It is the user's responsibility to ensure
199that all needed external models are registered before running One-Shot
200Bufferize.
201
202By default, One-Shot Bufferize fails when it encounters an op with tensor
203semantics (i.e., tensor result or tensor operand) that is not bufferizable
204(i.e., does not implement `BufferizableOpInterface`). This can be avoided with
205`allow-unknown-ops`. In that case, One-Shot Bufferize inserts
206`to_memref`/`to_tensor` ops around the bufferization boundary. These ops are
207named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's
208analysis can currently not analyze these ops, so input IR with such ops may fail
209bufferization. Therefore, running One-Shot Bufferize multiple times in a
210sequence is also not supported at the moment.
211
212One-Shot Bufferize can be configured to bufferize only ops from a set of
213dialects with `dialect-filter`. This can be useful for gradually migrating from
214dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize
215must run first in such a case, because dialect conversion-based bufferization
216generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze.
217
218One-Shot Bufferize can also be called programmatically with
219[`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
220Alternatively,
221[`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
222skips the analysis and inserts a copy on every buffer write, just like the
223dialect conversion-based bufferization.
224
225## Buffer Deallocation
226
227One-Shot Bufferize deallocates all buffers that it allocates. This is in
228contrast to the dialect conversion-based bufferization that delegates this job
229to the
230[`-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-buffer-deallocation-adds-all-required-dealloc-operations-for-all-allocations-in-the-input-program)
231pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer
232is returned from a block. Such IR will fail bufferization.
233
234A new buffer allocation is returned from a block when the result of an op that
235is not in destination-passing style is returned. E.g.:
236
237```mlir
238%0 = scf.if %c -> (tensor<?xf32>) {
239  %1 = tensor.generate ... -> tensor<?xf32>
240  scf.yield %1 : tensor<?xf32>
241} else {
242  scf.yield %another_tensor : tensor<?xf32>
243}
244```
245
246The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then"
247branch will be rejected.
248
249Another case in which a buffer allocation may be returned is when a buffer copy
250must be inserted due to a RaW conflict. E.g.:
251
252```mlir
253%0 = scf.if %c -> (tensor<?xf32>) {
254  %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32>
255  "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> ()
256  ...
257  scf.yield %1 : tensor<?xf32>
258} else {
259  scf.yield %yet_another_tensor : tensor<?xf32>
260}
261```
262
263In the above example, a buffer copy of buffer(`%another_tensor`) (with `%cst`
264inserted) is yielded from the "then" branch.
265
266In both examples, a buffer is allocated inside of a block and then yielded from
267the block. Deallocation of such buffers is tricky and not currently implemented
268in an efficient way. For this reason, One-Shot Bufferize must be explicitly
269configured with `allow-return-allocs` to support such IR.
270
271When running with `allow-return-allocs`, One-Shot Bufferize may introduce
272allocations that cannot be deallocated by One-Shot Bufferize yet. For that
273reason, `-buffer-deallocation` must be run after One-Shot Bufferize. This buffer
274deallocation pass resolves yields of newly allocated buffers with copies. E.g.,
275the `scf.if` example above would bufferize to IR similar to the following:
276
277```mlir
278%0 = scf.if %c -> (memref<?xf32>) {
279  %1 = memref.alloc(...) : memref<?xf32>
280  ...
281  scf.yield %1 : memref<?xf32>
282} else {
283  %2 = memref.alloc(...) : memref<?xf32>
284  memref.copy %another_memref, %2
285  scf.yield %2 : memref<?xf32>
286}
287```
288
289In the bufferized IR, both branches return a newly allocated buffer, so it does
290not matter which if-branch was taken. In both cases, the resulting buffer `%0`
291must be deallocated at some point after the `scf.if` (unless the `%0` is
292returned/yielded from its block).
293
294Note: Buffer allocations that are returned from a function are not deallocated,
295not even with `-buffer-deallocation`. It is the caller's responsibility to
296deallocate the buffer. In the future, this could be automated with allocation
297hoisting (across function boundaries) or reference counting.
298
299One-Shot Bufferize can be configured to leak all memory and not generate any
300buffer deallocations with `create-deallocs=0`. This can be useful for
301compatibility with legacy code that has its own method of deallocating buffers.
302
303## Memory Layouts
304
305One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
306ops are bufferizable. However, when encountering a non-bufferizable tensor with
307`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
308bufferization boundary and decide on a memref type. By default, One-Shot
309Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
310
311```mlir
312%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
313%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
314```
315
316When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
317dynamic offset and strides:
318
319```mlir
320#map = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
321%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
322%0_m = bufferization.to_memref %0 : memref<?x?xf32, #map>
323%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, #map>
324```
325
326All users of `%0` have fully dynamic layout maps. This ensures that the
327bufferized IR composes well with future bufferizations of `unbufferizable_op`
328(maybe bufferized by another pass), regardless of the exact memref type of the
329future bufferization. If the op turns out to be bufferized to an op with a
330simpler memref type (e.g., identity layout map), we expect that canonicalization
331patterns would clean up unnecessarily dynamic layout maps. (Some of these
332canonicalization patterns may not be implemented yet.)
333
334One-Shot Bufferize tries to infer the most precise memref type when bufferizing
335an op. If the entire IR is bufferizable, we do not have to resort to
336conservatively use fully dynamic layout maps. In that case, we also do not have
337to rely on canonicalization patterns to clean up the bufferized IR.
338
339Note: There are some bufferizable ops for which a percise layout map cannot be
340inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
341must be bufferized to a `memref.cast` with a memref type that has a fully
342dynamic layout map.
343
344One-Shot Bufferize has an option `unknown-type-conversion` to control the
345generation of layout maps when no precise layout can be inferred:
346
347*   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
348    behavior. This composes well when IR is partially bufferized.
349*   `identity-layout-map` uses static identity layout maps. This option can be
350    useful for legacy code that cannot handle memref types with layout maps.
351    Note that this setting can lead to additional buffer copies when folding a
352    `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.
353
354Note: The `unknown-type-conversion` option does not affect layout maps of
355function signatures. There is a separate `function-signature-type-conversion`
356option that controls layout maps of function parameters and function results.
357
358## Extending One-Shot Bufferize
359
360Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
361must at least implement the following interface methods.
362
363*   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
364    OpOperand is read.
365*   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
366    OpOperand is written (if bufferizing in-place).
367*   `getAliasingOpResult`: Return the OpResults that may share the same buffer
368    as the given OpOperand. This interface method describes to
369    OpOperand-to-OpResult mapping wrt. destination-passing style.
370*   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
371    is the exact same memref as the aliasing OpOperand after bufferization (in
372    case of in-place bufferization). Otherwise, (e.g., they overlap but are not
373    necessarily the exact same memrefs), `BufferRelation::None` should be
374    returned. Additional buffer relations will be added in the future, but
375    `BufferRelation::None` is always safe.
376*   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
377    with `bufferization::replaceOpWithBufferizedValues`.
378
379To get a better intuition of the interface methods, we invite users to take a
380look at existing implementations in MLIR, e.g., the implementation of
381`tensor.insert` or `tensor.extract`.
382
383## Debugging Buffer Copies
384
385To get a better understanding of why One-Shot Bufferize introduced a buffer
386copy, users can run the pass with `test-analysis-only print-conflicts`. Every
387tensor op is then annotated with an attribute that has a boolean value for each
388tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
389means that the OpOperand bufferizes out-of-place and a buffer copy will be
390inserted.
391
392There are two reasons why a buffer copy may be inserted.
393
3941.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
395    overwritten data is still needed.
3962.  The buffer is not writable. E.g., `memref.global` buffers that are the
397    result of `arith.constant` ops are never modified.
398
399In the first case, `print-conflicts` illustrates the conflict in the form of a
400("read", "conflicting write", "last write") tuple.
401
402## Understanding the SSA Use-Def Chain Analysis
403
404To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
405conflict detection algorithm, we invite interested users to read the
406[design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
407and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A)
408([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
409can be used to bufferize a program in a single pass, as long as each op
410
411## Migrating from Dialect Conversion-based Bufferization
412
413Both dialect conversion-based bufferization and One-Shot Bufferize generate
414`to_tensor`/`to_memref` ops at the bufferization boundary (when run with
415`allow-unknown-ops`). They can be combined and run in sequence. However,
416One-Shot Bufferize must run first because it cannot analyze those boundary ops.
417To update existing code step-by-step, it may be useful to specify a dialect
418filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.
419
420## Bufferization Function Graphs
421
422One-Shot Bufferize does currently not support function graph bufferization.
423I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can
424run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize.
425
426Alternatively, users can try
427[`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31),
428which is an extension of One-Shot Bufferize. This bufferization is still under
429development and does not support arbitrary IR. In essence, returning a tensor
430from a function is not supported, unless it is equivalent to a function bbArg.
431In that case, the corresponding return value can simply be dropped during
432bufferization.
433
434## Dialect Conversion-based Bufferization
435
436Disclaimer: Most dialect conversion-based bufferization has been migrated to
437One-Shot Bufferize. New users should use One-Shot Bufferize (with or without
438analysis). The following documentation is only for existing users of dialect
439conversion-based bufferization.
440
441This system is a simple application of MLIR's dialect conversion infrastructure.
442The bulk of the code related to bufferization is a set of ordinary
443`ConversionPattern`'s that dialect authors write for converting ops that operate
444on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best
445practices are followed that allow these patterns to be run across multiple
446independent passes (rather than requiring a single huge atomic conversion pass),
447which makes the compilation pipelines scalable, robust, and easy to debug.
448
449This document is targeted at people looking to utilize MLIR's bufferization
450functionality, along with people who want to extend it to cover their own ops.
451
452<a name="the-talk">**NOTE:**</a> Before reading this document, please watch the
453talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization
454Infrastructure"
455([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing),
456[recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)).
457That talk gives a high-level overview of the bufferization infrastructure and
458important conceptual details related to using the MLIR dialect conversion
459infrastructure.
460
461### Bufferization's place in a compilation pipeline
462
463Bufferization itself does not free any of the buffers that have been allocated,
464nor does it do anything particularly intelligent with the placement of buffers
465w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist
466of:
467
4681.  Bufferization
4691.  Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and
470    `promote-buffers-to-stack`, which do optimizations that are only exposed
471    after bufferization.
4721.  Finally, running the [buffer deallocation](BufferDeallocationInternals.md)
473    pass.
474
475After buffer deallocation has been completed, the program will be quite
476difficult to transform due to the presence of the deallocation ops. Thus, other
477optimizations such as linalg fusion on memrefs should be done before that stage.
478
479### General structure of the bufferization process
480
481Bufferization consists of running multiple *partial* bufferization passes,
482followed by one *finalizing* bufferization pass.
483
484There is typically one partial bufferization pass per dialect (though other
485subdivisions are possible). For example, for a dialect `X` there will typically
486be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect.
487By running pass `X-bufferize` for each dialect `X` in the program, all the ops
488in the program are incrementally bufferized.
489
490Partial bufferization passes create programs where only some ops have been
491bufferized. These passes will create *materializations* (also sometimes called
492"casts") that convert between the `tensor` and `memref` type, which allows
493bridging between ops that have been bufferized and ops that have not yet been
494bufferized.
495
496Finalizing bufferizations complete the bufferization process, and guarantee that
497there are no tensors remaining in the program. This involves eliminating the
498materializations. The pass `finalizing-bufferize` provides a minimal pass that
499only eliminates materializations and issues an error if any unbufferized ops
500exist in the program.
501
502However, it is possible for a finalizing bufferization to do more than just
503eliminate materializations. By adding patterns (just as a partial bufferization
504would), it is possible for a finalizing bufferization pass to simultaneously
505bufferize ops and eliminate materializations. This has a number of disadvantages
506discussed in the talk and should generally be avoided.
507
508### Example
509
510As a concrete example, we will look at the bufferization pipeline from the
511`mlir-npcomp` reference backend
512([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)).
513The code, slightly simplified and annotated, is reproduced here:
514
515```c++
516  // Partial bufferization passes.
517  pm.addPass(createTensorConstantBufferizePass());
518  pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect.
519  pm.addNestedPass<func::FuncOp>(createSCFBufferizePass());
520  pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass());
521  pm.addNestedPass<func::FuncOp>(createTensorBufferizePass());
522  pm.addPass(createFuncBufferizePass());
523
524  // Finalizing bufferization pass.
525  pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass());
526```
527
528Looking first at the partial bufferization passes, we see that there are a
529sequence of `FuncOp` passes (which run in parallel on functions). These function
530passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module
531passes (and thus serialize the parallel compilation process). These two passes
532must be module passes because they make changes to the top-level module.
533
534The bulk of the bufferization work is done by the function passes. Most of these
535passes are provided as part of the upstream MLIR distribution and bufferize
536their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect).
537The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass
538used to bufferize the downstream `tcp` dialect, and fits in perfectly with all
539the other passes provided upstream.
540
541The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference
542backend has arranged that all ops are bufferized by partial bufferizations, so
543that the upstream `finalizing-bufferize` pass can be used as the finalizing
544bufferization pass. This gives excellent diagnostics when something goes wrong
545with the bufferization process, such as due to an op that wasn't handled by any
546pattern.
547
548### How to write a partial bufferization pass
549
550The contract of a partial bufferization pass is that a subset of ops (or kinds
551of ops, customizable by a ConversionTarget) get bufferized.
552
553A partial bufferization pass is just a pass that uses the
554[dialect conversion](DialectConversion.md) framework to apply
555`ConversionPattern`s with a `tensor` to `memref` type conversion.
556
557To describe how to write such a pass, we will walk through an example, the
558`tensor-bufferize` pass
559([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23),
560[test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1))
561that bufferizes the `tensor` dialect. Note that these passes have been replaced
562with a `BufferizableOpInterface`-based implementation in the meantime, so we
563have to take a looker at an older version of the code.
564
565The bulk of the code in the pass will be a set of conversion patterns, with a
566simple example being
567[BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)).
568
569```
570class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> {
571public:
572  using OpConversionPattern::OpConversionPattern;
573  LogicalResult
574  matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor,
575                  ConversionPatternRewriter &rewriter) const override {
576    auto resultType = getTypeConverter()->convertType(op.getType());
577    rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source());
578    return success();
579  }
580};
581```
582
583See [the talk](#the-talk) for more details on how to write these patterns.
584
585The
586[pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57)
587is very small, and follows the basic pattern of any dialect conversion pass.
588
589```
590void mlir::populateTensorBufferizePatterns(
591    BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) {
592  patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter,
593                                                    patterns.getContext());
594}
595
596struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> {
597  void runOnOperation() override {
598    auto *context = &getContext();
599    BufferizeTypeConverter typeConverter;
600    RewritePatternSet patterns(context);
601    ConversionTarget target(*context);
602
603    populateTensorBufferizePatterns(typeConverter, patterns);
604    target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>();
605    target.addLegalDialect<func::FuncDialect>();
606
607    if (failed(
608            applyPartialConversion(getOperation(), target, std::move(patterns))))
609      signalPassFailure();
610  }
611};
612```
613
614The pass has all the hallmarks of a dialect conversion pass that does type
615conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`,
616and a call to `applyPartialConversion`. Note that a function
617`populateTensorBufferizePatterns` is separated, so that power users can use the
618patterns independently, if necessary (such as to combine multiple sets of
619conversion patterns into a single conversion call, for performance).
620
621One convenient utility provided by the MLIR bufferization infrastructure is the
622`BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions
623and materializations between `tensor` and `memref`.
624
625In this case, the `BufferizationOpsDialect` is marked as legal, so the
626`bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted
627automatically by the dialect conversion framework as materializations, are
628legal. There is a helper `populateBufferizeMaterializationLegality`
629([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53))
630which helps with this in general.
631
632### Other partial bufferization examples
633
634-   `scf-bufferize`
635    ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1),
636    [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1))
637
638    -   Bufferizes ops from the `scf` dialect.
639    -   This is an example of how to bufferize ops that implement
640        `RegionBranchOpInterface` (that is, they use regions to represent
641        control flow).
642    -   The bulk of the work is done by
643        `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp`
644        ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)),
645        which is well-commented and covers how to correctly convert ops that
646        contain regions.
647
648-   `func-bufferize`
649    ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1),
650    [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1))
651
652    -   Bufferizes `func`, `call`, and `BranchOpInterface` ops.
653    -   This is an example of how to bufferize ops that have multi-block
654        regions.
655    -   This is an example of a pass that is not split along dialect
656        subdivisions.
657
658### How to write a finalizing bufferization pass
659
660The contract of a finalizing bufferization pass is that all tensors are gone
661from the program.
662
663The easiest way to write a finalizing bufferize pass is to not write one at all!
664MLIR provides a pass `finalizing-bufferize` which eliminates the
665`bufferization.to_tensor` / `bufferization.to_memref` materialization ops
666inserted by partial bufferization passes and emits an error if that is not
667sufficient to remove all tensors from the program.
668
669This pass is sufficient when partial bufferization passes have bufferized all
670the ops in the program, leaving behind only the materializations. When possible,
671it is recommended to structure your pass pipeline this way, as this has the
672significant advantage that if an op does not get bufferized (due to a missing
673pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean
674error, and the IR seen by `finalizing-bufferize` will only contain only one
675unbufferized op.
676
677However, before the current bufferization infrastructure was put in place,
678bufferization could only be done as a single finalizing bufferization mega-pass
679that used the `populate*BufferizePatterns` functions from multiple dialects to
680simultaneously bufferize everything at once. Thus, one might see code in
681downstream projects structured this way. This structure is not recommended in
682new code. A helper, `populateEliminateBufferizeMaterializationsPatterns`
683([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58))
684is available for such passes to provide patterns that eliminate
685`bufferization.to_tensor` and `bufferization.to_memref`.
686
687### Changes since [the talk](#the-talk)
688
689-   `func-bufferize` was changed to be a partial conversion pass, and there is a
690    new `finalizing-bufferize` which serves as a general finalizing
691    bufferization pass.
692-   Most partial bufferization passes have been reimplemented in terms of
693    `BufferizableOpInterface`. New users should use One-Shot Bufferize instead
694    of dialect conversion-based bufferization.
695