1# Bufferization 2 3[TOC] 4 5## Overview 6 7Bufferization in MLIR is the process of converting ops with `tensor` semantics 8to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes 9an entire program in a single pass (*One-Shot Bufferize*). This infrastructure 10bufferizes all ops that implement the 11[`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td) 12can be bufferized. 13 14MLIR has an older bufferization infrastructure built around 15[dialect conversion](DialectConversion.md). Most dialect conversion 16bufferization patterns have been migrated to One-Shot Bufferize, but some 17functionality such as function boundary bufferization still depends on dialect 18conversion and its type converter. New projects should use One-Shot Bufferize, 19as the dialect conversion-based bufferization will eventually be deprecated. 20Moreover, One-Shot Bufferize results in better bufferization with fewer memory 21allocations and buffer copies. This documentation is mostly about One-Shot 22Bufferize, but also describes how to gradually migrate a project from dialect 23conversion-based bufferization to One-Shot Bufferize. 24 25## What is One-Shot Bufferize? 26 27One-Shot Bufferize is a new tensor bufferization pass designed for IR in 28[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf), 29and with aggressive in-place bufferization. 30 31One-Shot Bufferize is: 32 33* **Monolithic**: A single MLIR pass does the entire work, whereas the 34 previous bufferization in MLIR was split across multiple passes residing in 35 different dialects. In One-Shot Bufferize, `BufferizableOpInterface` 36 implementations are spread across different dialects. 37 38* A **whole-function at a time analysis**. In-place bufferization decisions 39 are made by analyzing SSA use-def chains on tensors. Op interface 40 implementations not only provide the rewrite logic from tensor ops to memref 41 ops, but also helper methods for One-Shot Bufferize's analysis to query 42 information about an op's bufferization/memory semantics. 43 44* **Extensible** via an op interface: All ops that implement 45 `BufferizableOpInterface` can be bufferized. 46 47* **2-Pass**: Bufferization is internally broken down into 2 steps: First, 48 analyze the entire IR and make bufferization decisions. Then, bufferize 49 (rewrite) the IR. The analysis has access to exact SSA use-def information. 50 It incrementally builds alias and equivalence sets and does not rely on a 51 posteriori-alias analysis from preallocated memory. 52 53* **Greedy**: Operations are analyzed one-by-one and it is decided on the spot 54 whether a tensor OpOperand must be copied or not. Heuristics determine the 55 order of analysis. 56 57* **Modular**: The current One-Shot Analysis can be replaced with a different 58 analysis. The result of the analysis are queried by the bufferization via 59 `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class 60 of `AnalysisState` that implements a small number virtual functions can 61 serve as a custom analysis. It is even possible to run One-Shot Bufferize 62 without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot 63 Bufferize behaves exactly like the old dialect conversion-based 64 bufferization (i.e., copy every buffer before writing to it). 65 66To reduce complexity, One-Shot Bufferize should be 67[run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373), 68typically as one of the last steps right before lowering memref ops. Many 69transformations are easier in tensor land; e.g., tile/fuse/… on tensors first, 70then bufferize the remaining IR. 71 72From an architecture perspective, One-Shot Bufferize consists of 73[BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td) 74(and its implementations) and an 75[analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164) 76of tensor SSA values that decides if a buffer can be used directly or must be 77copied. The [bufferize] method of the op interface inspects analysis results and 78rewrites tensor ops into memref ops. 79 80## Goals of Bufferization 81 82The high-level goal of every bufferization technique is to: 1. Use as little 83memory as possible. 2. Copy as little memory as possible. 84 85This implies reusing already allocated buffers when possible, turning 86bufferization into an algorithmically complex problem with similarities to 87register allocation. 88 89Depending on the concrete use case, there may be additional bufferization 90requirements. If the contents of a buffer are expensive to compute, there could 91be a tradeoff between *recomputation* and *compute once and copy*. On the 92contrary, it may not even be possible to allocate new buffers at runtime on some 93architectures. 94 95## Destination-Passing Style 96 97Bufferization is an algorithmically complex problem. Given an op with a tensor 98result, bufferization has to choose a memref buffer in which the result can be 99stored. It is always safe to allocate a brand new buffer, but such a 100bufferization strategy would be unacceptable for high-performance codegen. When 101choosing an already existing buffer, we must be careful not to accidentally 102overwrite data that is still needed later in the program. 103 104To simplify this problem, One-Shot Bufferize was designed for ops that are in 105*destination-passing style*. For every tensor result, such ops have a tensor 106operand, who's buffer could be for storing the result of the op in the absence 107of other conflicts. We call such tensor operands the *destination*. 108 109As an example, consider the following op: `%0 = tensor.insert %cst into 110%t[%idx] : tensor<?xf32>` 111 112`%t` is the destination in this example. When choosing a buffer for the result 113`%0`, One-Shot Bufferize considers only two options: 114 1151. buffer(`%0`) = buffer(`%t`). 1162. buffer(`%0`) is a newly allocated buffer. 117 118There may be other buffers in the same function that could potentially be used 119for buffer(`%0`), but those are not considered by One-Shot Bufferize to keep the 120bufferization simple. One-Shot Bufferize could be extended to consider such 121buffers in the future to achieve a better quality of bufferization. 122 123Tensor ops that are not in destination-passing style always bufferize to a 124memory allocation. E.g.: 125 126```mlir 127%0 = tensor.generate %sz { 128^bb0(%i : index): 129 %cst = arith.constant 0.0 : f32 130 tensor.yield %cst : f32 131} : tensor<?xf32> 132``` 133 134The result of `tensor.generate` does not have a "destination", so bufferization 135allocates a new buffer. This could be avoided by choosing an op such as 136`linalg.generic`, which can express the same computation with a destination 137("out") tensor: 138 139```mlir 140#map = affine_map<(i) -> (i)> 141%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} 142 outs(%t : tensor<?xf32>) { 143 ^bb0(%arg0 : f32): 144 %cst = arith.constant 0.0 : f32 145 linalg.yield %cst : f32 146} -> tensor<?xf32> 147``` 148 149At first glance, the above `linalg.generic` op may not seem very useful because 150the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an 151operand in the first place? As an example, this can be useful for overwriting a 152slice of a tensor: 153 154```mlir 155%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32> 156%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32> 157%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1] 158 : tensor<?xf32> into tensor<?xf32> 159``` 160 161The above example bufferizes to a `memref.subview`, followed by a 162"`linalg.generic` on memrefs" that overwrites the memory of the subview. The 163`tensor.insert_slice` bufferizes to a no-op (in the absence of RaW conflicts 164such as a subsequent read of `%s`). 165 166RaW conflicts are detected with an analysis of SSA use-def chains (details 167later). One-Shot Bufferize works best if there is a single SSA use-def chain, 168where the result of a tensor op is the "destination" operand of the next tensor 169ops, e.g.: 170 171```mlir 172%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) 173%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 174%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>) 175``` 176 177Buffer copies are likely inserted if the SSA use-def chain splits at some point, 178e.g.: 179 180```mlir 181%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) 182%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 183%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) 184``` 185 186One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that 187print the results of the analysis and explain to the user why buffer copies were 188inserted. 189 190## Using One-Shot Bufferize 191 192MLIR provides a pass 193[`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize) 194that performs an analysis and bufferizes all ops with tensor semantics that 195implement `BufferizableOpInterface`. For modularity reasons, these op interface 196implementations are typically external models that live in a dialect's 197"Transforms" build unit. (External models are a mechanism for implementing an op 198interface in a different build unit.) It is the user's responsibility to ensure 199that all needed external models are registered before running One-Shot 200Bufferize. 201 202By default, One-Shot Bufferize fails when it encounters an op with tensor 203semantics (i.e., tensor result or tensor operand) that is not bufferizable 204(i.e., does not implement `BufferizableOpInterface`). This can be avoided with 205`allow-unknown-ops`. In that case, One-Shot Bufferize inserts 206`to_memref`/`to_tensor` ops around the bufferization boundary. These ops are 207named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's 208analysis can currently not analyze these ops, so input IR with such ops may fail 209bufferization. Therefore, running One-Shot Bufferize multiple times in a 210sequence is also not supported at the moment. 211 212One-Shot Bufferize can be configured to bufferize only ops from a set of 213dialects with `dialect-filter`. This can be useful for gradually migrating from 214dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize 215must run first in such a case, because dialect conversion-based bufferization 216generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze. 217 218One-Shot Bufferize can also be called programmatically with 219[`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167). 220Alternatively, 221[`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78) 222skips the analysis and inserts a copy on every buffer write, just like the 223dialect conversion-based bufferization. 224 225## Buffer Deallocation 226 227One-Shot Bufferize deallocates all buffers that it allocates. This is in 228contrast to the dialect conversion-based bufferization that delegates this job 229to the 230[`-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-buffer-deallocation-adds-all-required-dealloc-operations-for-all-allocations-in-the-input-program) 231pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer 232is returned from a block. Such IR will fail bufferization. 233 234A new buffer allocation is returned from a block when the result of an op that 235is not in destination-passing style is returned. E.g.: 236 237```mlir 238%0 = scf.if %c -> (tensor<?xf32>) { 239 %1 = tensor.generate ... -> tensor<?xf32> 240 scf.yield %1 : tensor<?xf32> 241} else { 242 scf.yield %another_tensor : tensor<?xf32> 243} 244``` 245 246The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then" 247branch will be rejected. 248 249Another case in which a buffer allocation may be returned is when a buffer copy 250must be inserted due to a RaW conflict. E.g.: 251 252```mlir 253%0 = scf.if %c -> (tensor<?xf32>) { 254 %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32> 255 "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> () 256 ... 257 scf.yield %1 : tensor<?xf32> 258} else { 259 scf.yield %yet_another_tensor : tensor<?xf32> 260} 261``` 262 263In the above example, a buffer copy of buffer(`%another_tensor`) (with `%cst` 264inserted) is yielded from the "then" branch. 265 266In both examples, a buffer is allocated inside of a block and then yielded from 267the block. Deallocation of such buffers is tricky and not currently implemented 268in an efficient way. For this reason, One-Shot Bufferize must be explicitly 269configured with `allow-return-allocs` to support such IR. 270 271When running with `allow-return-allocs`, One-Shot Bufferize may introduce 272allocations that cannot be deallocated by One-Shot Bufferize yet. For that 273reason, `-buffer-deallocation` must be run after One-Shot Bufferize. This buffer 274deallocation pass resolves yields of newly allocated buffers with copies. E.g., 275the `scf.if` example above would bufferize to IR similar to the following: 276 277```mlir 278%0 = scf.if %c -> (memref<?xf32>) { 279 %1 = memref.alloc(...) : memref<?xf32> 280 ... 281 scf.yield %1 : memref<?xf32> 282} else { 283 %2 = memref.alloc(...) : memref<?xf32> 284 memref.copy %another_memref, %2 285 scf.yield %2 : memref<?xf32> 286} 287``` 288 289In the bufferized IR, both branches return a newly allocated buffer, so it does 290not matter which if-branch was taken. In both cases, the resulting buffer `%0` 291must be deallocated at some point after the `scf.if` (unless the `%0` is 292returned/yielded from its block). 293 294Note: Buffer allocations that are returned from a function are not deallocated, 295not even with `-buffer-deallocation`. It is the caller's responsibility to 296deallocate the buffer. In the future, this could be automated with allocation 297hoisting (across function boundaries) or reference counting. 298 299One-Shot Bufferize can be configured to leak all memory and not generate any 300buffer deallocations with `create-deallocs=0`. This can be useful for 301compatibility with legacy code that has its own method of deallocating buffers. 302 303## Memory Layouts 304 305One-Shot Bufferize bufferizes ops from top to bottom. This works well when all 306ops are bufferizable. However, when encountering a non-bufferizable tensor with 307`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the 308bufferization boundary and decide on a memref type. By default, One-Shot 309Bufferize choose the most dynamic memref type wrt. layout maps. E.g.: 310 311```mlir 312%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) 313%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32> 314``` 315 316When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with 317dynamic offset and strides: 318 319```mlir 320#map = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)> 321%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) 322%0_m = bufferization.to_memref %0 : memref<?x?xf32, #map> 323%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, #map> 324``` 325 326All users of `%0` have fully dynamic layout maps. This ensures that the 327bufferized IR composes well with future bufferizations of `unbufferizable_op` 328(maybe bufferized by another pass), regardless of the exact memref type of the 329future bufferization. If the op turns out to be bufferized to an op with a 330simpler memref type (e.g., identity layout map), we expect that canonicalization 331patterns would clean up unnecessarily dynamic layout maps. (Some of these 332canonicalization patterns may not be implemented yet.) 333 334One-Shot Bufferize tries to infer the most precise memref type when bufferizing 335an op. If the entire IR is bufferizable, we do not have to resort to 336conservatively use fully dynamic layout maps. In that case, we also do not have 337to rely on canonicalization patterns to clean up the bufferized IR. 338 339Note: There are some bufferizable ops for which a percise layout map cannot be 340inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>` 341must be bufferized to a `memref.cast` with a memref type that has a fully 342dynamic layout map. 343 344One-Shot Bufferize has an option `unknown-type-conversion` to control the 345generation of layout maps when no precise layout can be inferred: 346 347* `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default 348 behavior. This composes well when IR is partially bufferized. 349* `identity-layout-map` uses static identity layout maps. This option can be 350 useful for legacy code that cannot handle memref types with layout maps. 351 Note that this setting can lead to additional buffer copies when folding a 352 `to_tensor`/`to_memref` pair with memref types that are not cast-compatible. 353 354Note: The `unknown-type-conversion` option does not affect layout maps of 355function signatures. There is a separate `function-signature-type-conversion` 356option that controls layout maps of function parameters and function results. 357 358## Extending One-Shot Bufferize 359 360Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users 361must at least implement the following interface methods. 362 363* `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor 364 OpOperand is read. 365* `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor 366 OpOperand is written (if bufferizing in-place). 367* `getAliasingOpResult`: Return the OpResults that may share the same buffer 368 as the given OpOperand. This interface method describes to 369 OpOperand-to-OpResult mapping wrt. destination-passing style. 370* `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult 371 is the exact same memref as the aliasing OpOperand after bufferization (in 372 case of in-place bufferization). Otherwise, (e.g., they overlap but are not 373 necessarily the exact same memrefs), `BufferRelation::None` should be 374 returned. Additional buffer relations will be added in the future, but 375 `BufferRelation::None` is always safe. 376* `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced 377 with `bufferization::replaceOpWithBufferizedValues`. 378 379To get a better intuition of the interface methods, we invite users to take a 380look at existing implementations in MLIR, e.g., the implementation of 381`tensor.insert` or `tensor.extract`. 382 383## Debugging Buffer Copies 384 385To get a better understanding of why One-Shot Bufferize introduced a buffer 386copy, users can run the pass with `test-analysis-only print-conflicts`. Every 387tensor op is then annotated with an attribute that has a boolean value for each 388tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false` 389means that the OpOperand bufferizes out-of-place and a buffer copy will be 390inserted. 391 392There are two reasons why a buffer copy may be inserted. 393 3941. Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the 395 overwritten data is still needed. 3962. The buffer is not writable. E.g., `memref.global` buffers that are the 397 result of `arith.constant` ops are never modified. 398 399In the first case, `print-conflicts` illustrates the conflict in the form of a 400("read", "conflicting write", "last write") tuple. 401 402## Understanding the SSA Use-Def Chain Analysis 403 404To get a better understanding of the SSA Use-Def Chain Analysis and the RaW 405conflict detection algorithm, we invite interested users to read the 406[design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf) 407and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A) 408([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)). 409can be used to bufferize a program in a single pass, as long as each op 410 411## Migrating from Dialect Conversion-based Bufferization 412 413Both dialect conversion-based bufferization and One-Shot Bufferize generate 414`to_tensor`/`to_memref` ops at the bufferization boundary (when run with 415`allow-unknown-ops`). They can be combined and run in sequence. However, 416One-Shot Bufferize must run first because it cannot analyze those boundary ops. 417To update existing code step-by-step, it may be useful to specify a dialect 418filter for One-Shot Bufferize, so that dialects can be switched over one-by-one. 419 420## Bufferization Function Graphs 421 422One-Shot Bufferize does currently not support function graph bufferization. 423I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can 424run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize. 425 426Alternatively, users can try 427[`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31), 428which is an extension of One-Shot Bufferize. This bufferization is still under 429development and does not support arbitrary IR. In essence, returning a tensor 430from a function is not supported, unless it is equivalent to a function bbArg. 431In that case, the corresponding return value can simply be dropped during 432bufferization. 433 434## Dialect Conversion-based Bufferization 435 436Disclaimer: Most dialect conversion-based bufferization has been migrated to 437One-Shot Bufferize. New users should use One-Shot Bufferize (with or without 438analysis). The following documentation is only for existing users of dialect 439conversion-based bufferization. 440 441This system is a simple application of MLIR's dialect conversion infrastructure. 442The bulk of the code related to bufferization is a set of ordinary 443`ConversionPattern`'s that dialect authors write for converting ops that operate 444on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best 445practices are followed that allow these patterns to be run across multiple 446independent passes (rather than requiring a single huge atomic conversion pass), 447which makes the compilation pipelines scalable, robust, and easy to debug. 448 449This document is targeted at people looking to utilize MLIR's bufferization 450functionality, along with people who want to extend it to cover their own ops. 451 452<a name="the-talk">**NOTE:**</a> Before reading this document, please watch the 453talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization 454Infrastructure" 455([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing), 456[recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)). 457That talk gives a high-level overview of the bufferization infrastructure and 458important conceptual details related to using the MLIR dialect conversion 459infrastructure. 460 461### Bufferization's place in a compilation pipeline 462 463Bufferization itself does not free any of the buffers that have been allocated, 464nor does it do anything particularly intelligent with the placement of buffers 465w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist 466of: 467 4681. Bufferization 4691. Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and 470 `promote-buffers-to-stack`, which do optimizations that are only exposed 471 after bufferization. 4721. Finally, running the [buffer deallocation](BufferDeallocationInternals.md) 473 pass. 474 475After buffer deallocation has been completed, the program will be quite 476difficult to transform due to the presence of the deallocation ops. Thus, other 477optimizations such as linalg fusion on memrefs should be done before that stage. 478 479### General structure of the bufferization process 480 481Bufferization consists of running multiple *partial* bufferization passes, 482followed by one *finalizing* bufferization pass. 483 484There is typically one partial bufferization pass per dialect (though other 485subdivisions are possible). For example, for a dialect `X` there will typically 486be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect. 487By running pass `X-bufferize` for each dialect `X` in the program, all the ops 488in the program are incrementally bufferized. 489 490Partial bufferization passes create programs where only some ops have been 491bufferized. These passes will create *materializations* (also sometimes called 492"casts") that convert between the `tensor` and `memref` type, which allows 493bridging between ops that have been bufferized and ops that have not yet been 494bufferized. 495 496Finalizing bufferizations complete the bufferization process, and guarantee that 497there are no tensors remaining in the program. This involves eliminating the 498materializations. The pass `finalizing-bufferize` provides a minimal pass that 499only eliminates materializations and issues an error if any unbufferized ops 500exist in the program. 501 502However, it is possible for a finalizing bufferization to do more than just 503eliminate materializations. By adding patterns (just as a partial bufferization 504would), it is possible for a finalizing bufferization pass to simultaneously 505bufferize ops and eliminate materializations. This has a number of disadvantages 506discussed in the talk and should generally be avoided. 507 508### Example 509 510As a concrete example, we will look at the bufferization pipeline from the 511`mlir-npcomp` reference backend 512([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)). 513The code, slightly simplified and annotated, is reproduced here: 514 515```c++ 516 // Partial bufferization passes. 517 pm.addPass(createTensorConstantBufferizePass()); 518 pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect. 519 pm.addNestedPass<func::FuncOp>(createSCFBufferizePass()); 520 pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass()); 521 pm.addNestedPass<func::FuncOp>(createTensorBufferizePass()); 522 pm.addPass(createFuncBufferizePass()); 523 524 // Finalizing bufferization pass. 525 pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass()); 526``` 527 528Looking first at the partial bufferization passes, we see that there are a 529sequence of `FuncOp` passes (which run in parallel on functions). These function 530passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module 531passes (and thus serialize the parallel compilation process). These two passes 532must be module passes because they make changes to the top-level module. 533 534The bulk of the bufferization work is done by the function passes. Most of these 535passes are provided as part of the upstream MLIR distribution and bufferize 536their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect). 537The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass 538used to bufferize the downstream `tcp` dialect, and fits in perfectly with all 539the other passes provided upstream. 540 541The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference 542backend has arranged that all ops are bufferized by partial bufferizations, so 543that the upstream `finalizing-bufferize` pass can be used as the finalizing 544bufferization pass. This gives excellent diagnostics when something goes wrong 545with the bufferization process, such as due to an op that wasn't handled by any 546pattern. 547 548### How to write a partial bufferization pass 549 550The contract of a partial bufferization pass is that a subset of ops (or kinds 551of ops, customizable by a ConversionTarget) get bufferized. 552 553A partial bufferization pass is just a pass that uses the 554[dialect conversion](DialectConversion.md) framework to apply 555`ConversionPattern`s with a `tensor` to `memref` type conversion. 556 557To describe how to write such a pass, we will walk through an example, the 558`tensor-bufferize` pass 559([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23), 560[test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1)) 561that bufferizes the `tensor` dialect. Note that these passes have been replaced 562with a `BufferizableOpInterface`-based implementation in the meantime, so we 563have to take a looker at an older version of the code. 564 565The bulk of the code in the pass will be a set of conversion patterns, with a 566simple example being 567[BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)). 568 569``` 570class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> { 571public: 572 using OpConversionPattern::OpConversionPattern; 573 LogicalResult 574 matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor, 575 ConversionPatternRewriter &rewriter) const override { 576 auto resultType = getTypeConverter()->convertType(op.getType()); 577 rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source()); 578 return success(); 579 } 580}; 581``` 582 583See [the talk](#the-talk) for more details on how to write these patterns. 584 585The 586[pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57) 587is very small, and follows the basic pattern of any dialect conversion pass. 588 589``` 590void mlir::populateTensorBufferizePatterns( 591 BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) { 592 patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter, 593 patterns.getContext()); 594} 595 596struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> { 597 void runOnOperation() override { 598 auto *context = &getContext(); 599 BufferizeTypeConverter typeConverter; 600 RewritePatternSet patterns(context); 601 ConversionTarget target(*context); 602 603 populateTensorBufferizePatterns(typeConverter, patterns); 604 target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>(); 605 target.addLegalDialect<func::FuncDialect>(); 606 607 if (failed( 608 applyPartialConversion(getOperation(), target, std::move(patterns)))) 609 signalPassFailure(); 610 } 611}; 612``` 613 614The pass has all the hallmarks of a dialect conversion pass that does type 615conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`, 616and a call to `applyPartialConversion`. Note that a function 617`populateTensorBufferizePatterns` is separated, so that power users can use the 618patterns independently, if necessary (such as to combine multiple sets of 619conversion patterns into a single conversion call, for performance). 620 621One convenient utility provided by the MLIR bufferization infrastructure is the 622`BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions 623and materializations between `tensor` and `memref`. 624 625In this case, the `BufferizationOpsDialect` is marked as legal, so the 626`bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted 627automatically by the dialect conversion framework as materializations, are 628legal. There is a helper `populateBufferizeMaterializationLegality` 629([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53)) 630which helps with this in general. 631 632### Other partial bufferization examples 633 634- `scf-bufferize` 635 ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1), 636 [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1)) 637 638 - Bufferizes ops from the `scf` dialect. 639 - This is an example of how to bufferize ops that implement 640 `RegionBranchOpInterface` (that is, they use regions to represent 641 control flow). 642 - The bulk of the work is done by 643 `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp` 644 ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)), 645 which is well-commented and covers how to correctly convert ops that 646 contain regions. 647 648- `func-bufferize` 649 ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1), 650 [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1)) 651 652 - Bufferizes `func`, `call`, and `BranchOpInterface` ops. 653 - This is an example of how to bufferize ops that have multi-block 654 regions. 655 - This is an example of a pass that is not split along dialect 656 subdivisions. 657 658### How to write a finalizing bufferization pass 659 660The contract of a finalizing bufferization pass is that all tensors are gone 661from the program. 662 663The easiest way to write a finalizing bufferize pass is to not write one at all! 664MLIR provides a pass `finalizing-bufferize` which eliminates the 665`bufferization.to_tensor` / `bufferization.to_memref` materialization ops 666inserted by partial bufferization passes and emits an error if that is not 667sufficient to remove all tensors from the program. 668 669This pass is sufficient when partial bufferization passes have bufferized all 670the ops in the program, leaving behind only the materializations. When possible, 671it is recommended to structure your pass pipeline this way, as this has the 672significant advantage that if an op does not get bufferized (due to a missing 673pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean 674error, and the IR seen by `finalizing-bufferize` will only contain only one 675unbufferized op. 676 677However, before the current bufferization infrastructure was put in place, 678bufferization could only be done as a single finalizing bufferization mega-pass 679that used the `populate*BufferizePatterns` functions from multiple dialects to 680simultaneously bufferize everything at once. Thus, one might see code in 681downstream projects structured this way. This structure is not recommended in 682new code. A helper, `populateEliminateBufferizeMaterializationsPatterns` 683([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58)) 684is available for such passes to provide patterns that eliminate 685`bufferization.to_tensor` and `bufferization.to_memref`. 686 687### Changes since [the talk](#the-talk) 688 689- `func-bufferize` was changed to be a partial conversion pass, and there is a 690 new `finalizing-bufferize` which serves as a general finalizing 691 bufferization pass. 692- Most partial bufferization passes have been reimplemented in terms of 693 `BufferizableOpInterface`. New users should use One-Shot Bufferize instead 694 of dialect conversion-based bufferization. 695