1# 'vector' Dialect
2
3[TOC]
4
5MLIR supports multi-dimensional `vector` types and custom operations on those
6types. A generic, retargetable, higher-order `vector` type (`n-D` with `n > 1`)
7is a structured type, that carries semantic information useful for
8transformations. This document discusses retargetable abstractions that exist in
9MLIR today and operate on ssa-values of type `vector` along with pattern
10rewrites and lowerings that enable targeting specific instructions on concrete
11targets. These abstractions serve to separate concerns between operations on
12`memref` (a.k.a buffers) and operations on `vector` values. This is not a new
13proposal but rather a textual documentation of existing MLIR components along
14with a rationale.
15
16## Positioning in the Codegen Infrastructure
17
18The following diagram, recently presented with the
19[StructuredOps abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
20captures the current codegen paths implemented in MLIR in the various existing
21lowering paths.
22![](https://user-images.githubusercontent.com/10148468/71177417-f78e4d80-2239-11ea-92ef-700f42ea503f.png)
23
24The following diagram seeks to isolate `vector` dialects from the complexity of
25the codegen paths and focus on the payload-carrying ops that operate on std and
26`vector` types. This diagram is not to be taken as set in stone and
27representative of what exists today but rather illustrates the layering of
28abstractions in MLIR.
29
30![`vector` Abstractions in MLIR](https://user-images.githubusercontent.com/10148468/71176949-e85ad000-2238-11ea-9806-200843bc4943.png)
31
32This  separates concerns related to (a) defining efficient operations on
33`vector` types from (b) program analyses + transformations on `memref`, loops
34and other types of structured ops (be they `HLO`, `LHLO`, `Linalg` or other ).
35Looking a bit forward in time, we can put a stake in the ground and venture that
36the higher level of `vector`-level primitives we build and target from codegen
37(or some user/language level), the simpler our task will be, the more complex
38patterns can be expressed and the better performance will be.
39
40## Components of a Generic Retargetable Vector-Level Dialect
41
42The existing MLIR `vector`-level dialects are related to the following bottom-up
43abstractions:
44
451.  Representation in `LLVMIR` via data structures, instructions and intrinsics.
46    This is referred to as the `LLVM` level.
472.  Set of machine-specific operations and types that are built to translate
48    almost 1-1 with the HW ISA. This is referred to as the Hardware Vector
49    level; a.k.a `HWV`. For instance, we have (a) the `NVVM` dialect (for
50    `CUDA`) with tensor core ops, (b) accelerator-specific dialects (internal),
51    a potential (future) `CPU` dialect to capture `LLVM` intrinsics more closely
52    and other dialects for specific hardware. Ideally this should be
53    auto-generated as much as possible from the `LLVM` level.
543.  Set of virtual, machine-agnostic, operations that are informed by costs at
55    the `HWV`-level. This is referred to as the Virtual Vector level; a.k.a
56    `VV`. This is the level that higher-level abstractions (codegen, automatic
57    vectorization, potential vector language, ...) targets.
58
59The existing generic, retargetable, `vector`-level dialect is related to the
60following top-down rewrites and conversions:
61
621.  MLIR Rewrite Patterns applied by the MLIR `PatternRewrite` infrastructure to
63    progressively lower to implementations that match closer and closer to the
64    `HWV`. Some patterns are "in-dialect" `VV -> VV` and some are conversions
65    `VV -> HWV`.
662.  `Virtual Vector -> Hardware Vector` lowering is specified as a set of MLIR
67    lowering patterns that are specified manually for now.
683.  `Hardware Vector -> LLVM` lowering is a mechanical process that is written
69    manually at the moment and that should be automated, following the `LLVM ->
70    Hardware Vector` ops generation as closely as possible.
71
72## Short Description of the Existing Infrastructure
73
74### LLVM level
75
76On CPU, the `n-D` `vector` type currently lowers to `!llvm<array<vector>>`. More
77concretely, `vector<4x8x128xf32>` lowers to `!llvm<[4 x [ 8 x [ 128 x float
78]]]>`. There are tradeoffs involved related to how one can access subvectors and
79how one uses `llvm.extractelement`, `llvm.insertelement` and
80`llvm.shufflevector`. A [deeper dive section](#DeeperDive) discusses the current
81lowering choices and tradeoffs.
82
83### Hardware Vector Ops
84
85Hardware Vector Ops are implemented as one dialect per target. For internal
86hardware, we are auto-generating the specific HW dialects. For `GPU`, the `NVVM`
87dialect adds operations such as `mma.sync`, `shfl` and tests. For `CPU` things
88are somewhat in-flight because the abstraction is close to `LLVMIR`. The jury is
89still out on  whether a generic `CPU` dialect is concretely needed, but it seems
90reasonable to have the same levels of abstraction for all targets and perform
91cost-based lowering decisions in MLIR even for `LLVM`. Specialized `CPU`
92dialects that would capture specific features not well captured by LLVM peephole
93optimizations of on different types that core MLIR supports (e.g. Scalable
94Vectors) are welcome future extensions.
95
96### Virtual Vector Ops
97
98Some existing Arithmetic and Vector Dialect on `n-D` `vector` types comprise:
99
100```mlir
101// Produces a vector<3x7x8xf32>
102%a = arith.addf %0, %1 : vector<3x7x8xf32>
103// Produces a vector<3x7x8xf32>
104%b = arith.mulf %0, %1 : vector<3x7x8xf32>
105// Produces a vector<3x7x8xf32>
106%c = vector.splat %1 : vector<3x7x8xf32>
107
108%d = vector.extract %0[1]: vector<3x7x8xf32>     // -> vector<7x8xf32>
109%e = vector.extract %0[1, 5]: vector<3x7x8xf32>  // -> vector<8xf32>
110%f = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32>      // -> vector<4x8xf32>
111%g = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32>  // fma when adding %2
112
113// Returns a slice of type vector<2x2x16xf32>
114%h = vector.strided_slice %0
115    {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
116  vector<4x8x16xf32>
117
118%i = vector.transfer_read %A[%0, %1]
119    {permutation_map = (d0, d1) -> (d0)}:
120  memref<7x?xf32>, vector<4xf32>
121
122vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
123    {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
124  vector<5x4x3xf32>, memref<?x?x?x?xf32>
125```
126
127The list of Vector is currently undergoing evolutions and is best kept track of
128by following the evolution of the
129[VectorOps.td](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/Vector/IR/VectorOps.td)
130ODS file (markdown documentation is automatically generated locally when
131building and populates the
132[Vector doc](https://github.com/llvm/llvm-project/blob/main/mlir/docs/Dialects/Vector.md)).
133Recent extensions are driven by concrete use cases of interest. A notable such
134use case is the `vector.contract` op which applies principles of the
135StructuredOps abstraction to `vector` types.
136
137### Virtual Vector Rewrite Patterns
138
139The following rewrite patterns exist at the `VV->VV` level:
140
1411.  The now retired `MaterializeVector` pass used to legalize ops on a
142    coarse-grained virtual `vector` to a finer-grained virtual `vector` by
143    unrolling. This has been rewritten as a retargetable unroll-and-jam pattern
144    on `vector` ops and `vector` types.
1452.  The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
146    permuted loops over scalar load/stores. This should evolve to loops over
147    `vector` load/stores + `mask` operations as they become available `vector`
148    ops at the `VV` level.
149
150The general direction is to add more Virtual Vector level ops and implement more
151useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
152infrastructure can apply iteratively.
153
154### Virtual Vector to Hardware Vector Lowering
155
156For now, `VV -> HWV` are specified in C++ (see for instance the
157[SplatOpLowering for n-D vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
158or the
159[VectorOuterProductOp lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
160
161Simple
162[conversion tests](https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
163are available for the `LLVM` target starting from the Virtual Vector Level.
164
165## Rationale
166
167### Hardware as `vector` Machines of Minimum Granularity
168
169Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
170think about Generic Retargetable `vector`-Level Dialect is that it operates on
171`vector` types that are multiples of a "good" `vector` size so the HW can
172efficiently implement a set of high-level primitives (e.g.
173`vector<8x8x8x16xf32>` when HW `vector` size is say `vector<4x8xf32>`).
174
175Some notable `vector` sizes of interest include:
176
1771.  CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
178    HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
179    HW_vector_size * k>`
1802.  GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
181    `vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
1823.  Other accelerators: n-D `vector` as first-class citizens in the HW.
183
184Depending on the target, ops on sizes that are not multiples of the HW `vector`
185size may either produce slow code (e.g. by going through `LLVM` legalization) or
186may not legalize at all (e.g. some unsupported accelerator X combination of ops
187and types).
188
189### Transformations Problems Avoided
190
191A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
192“unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
193carried in the IR by the `vector` type. After unrolling, traditional
194instruction-level scheduling can be run.
195
196The following key transformations (along with the supporting analyses and
197structural constraints) are completely avoided by operating on a `vector`
198`ssa-value` abstraction:
199
2001.  Loop unroll and unroll-and-jam.
2012.  Loop and load-store restructuring for register reuse.
2023.  Load to store forwarding and Mem2reg.
2034.  Coarsening (raising) from finer-grained `vector` form.
204
205Note that “unrolling” in the context of `vector`s corresponds to partial loop
206unroll-and-jam and not full unrolling. As a consequence this is expected to
207compose with SW pipelining where applicable and does not result in ICache blow
208up.
209
210### The Big Out-Of-Scope Piece: Automatic Vectorization
211
212One important piece not discussed here is automatic vectorization (automatically
213raising from scalar to n-D `vector` ops and types). The TL;DR is that when the
214first "super-vectorization" prototype was implemented, MLIR was nowhere near as
215mature as it is today. As we continue building more abstractions in `VV -> HWV`,
216there is an opportunity to revisit vectorization in MLIR.
217
218Since this topic touches on codegen abstractions, it is technically out of the
219scope of this survey document but there is a lot to discuss in light of
220structured op type representations and how a vectorization transformation can be
221reused across dialects. In particular, MLIR allows the definition of dialects at
222arbitrary levels of granularity and lends itself favorably to progressive
223lowering. The argument can be made that automatic vectorization on a loops + ops
224abstraction is akin to raising structural information that has been lost.
225Instead, it is possible to revisit vectorization as simple pattern rewrites,
226provided the IR is in a suitable form. For instance, vectorizing a
227`linalg.generic` op whose semantics match a `matmul` can be done
228[quite easily with a pattern](https://github.com/tensorflow/mlir/commit/bff722d6b59ab99b998f0c2b9fccd0267d9f93b5).
229In fact this pattern is trivial to generalize to any type of contraction when
230targeting the `vector.contract` op, as well as to any field (`+/*`, `min/+`,
231`max/+`, `or/and`, `logsumexp/+` ...) . In other words, by operating on a higher
232level of generic abstractions than affine loops, non-trivial transformations
233become significantly simpler and composable at a finer granularity.
234
235Irrespective of the existence of an auto-vectorizer, one can build a notional
236vector language based on the VectorOps dialect and build end-to-end models with
237expressing `vector`s in the IR directly and simple pattern-rewrites.
238[EDSC](https://github.com/llvm/llvm-project/blob/main/mlir/docs/EDSC.md)s
239provide a simple way of driving such a notional language directly in C++.
240
241## Bikeshed Naming Discussion
242
243There are arguments against naming an n-D level of abstraction `vector` because
244most people associate it with 1-D `vector`s. On the other hand, `vector`s are
245first-class n-D values in MLIR. The alternative name Tile has been proposed,
246which conveys higher-D meaning. But it also is one of the most overloaded terms
247in compilers and hardware. For now, we generally use the `n-D` `vector` name and
248are open to better suggestions.
249
250## DeeperDive
251
252This section describes the tradeoffs involved in lowering the MLIR n-D vector
253type and operations on it to LLVM-IR. Putting aside the
254[LLVM Matrix](http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html)
255proposal for now, this assumes LLVM only has built-in support for 1-D vector.
256The relationship with the LLVM Matrix proposal is discussed at the end of this
257document.
258
259MLIR does not currently support dynamic vector sizes (i.e. SVE style) so the
260discussion is limited to static rank and static vector sizes (e.g.
261`vector<4x8x16x32xf32>`). This section discusses operations on vectors in LLVM
262and MLIR.
263
264LLVM instructions are prefixed by the `llvm.` dialect prefix (e.g.
265`llvm.insertvalue`). Such ops operate exclusively on 1-D vectors and aggregates
266following the [LLVM LangRef](https://llvm.org/docs/LangRef.html). MLIR
267operations are prefixed by the `vector.` dialect prefix (e.g.
268`vector.insertelement`). Such ops operate exclusively on MLIR `n-D` `vector`
269types.
270
271### Alternatives For Lowering an n-D Vector Type to LLVM
272
273Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an MLIR
274`vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to an
275LLVM descriptor can be done by either:
276
2771.  Flattening to a `1-D` vector: `!llvm<"(s_0*...*s_{n-1})xfloat">` in the MLIR
278    LLVM dialect.
2792.  Nested aggregate type of `1-D` vector:
280    `!llvm."[s_0x[s_1x[...<s_{n-1}xf32>]]]">` in the MLIR LLVM dialect.
2813.  A mix of both.
282
283There are multiple tradeoffs involved in choosing one or the other that we
284discuss. It is important to note that “a mix of both” immediately reduces to
285“nested aggregate type of 1-D vector” with a `vector.cast %0:
286vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
287"k" minor dimensions.
288
289### Constraints Inherited from LLVM (see LangRef)
290
291The first constraint was already mentioned: LLVM only supports `1-D` `vector`
292types natively. Additional constraints are related to the difference in LLVM
293between vector and aggregate types: `“Aggregate Types are a subset of derived
294types that can contain multiple member types. Arrays and structs are aggregate
295types. Vectors are not considered to be aggregate types.”.`
296
297This distinction is also reflected in some of the operations. For `1-D` vectors,
298the operations `llvm.extractelement`, `llvm.insertelement`, and
299`llvm.shufflevector` apply, with direct support for dynamic indices. For `n-D`
300vectors with `n>1`, and thus aggregate types at LLVM level, the more restrictive
301operations `llvm.extractvalue` and `llvm.insertvalue` apply, which only accept
302static indices. There is no direct shuffling support for aggregate types.
303
304The next sentence illustrates a recurrent tradeoff, also found in MLIR, between
305“value types” (subject to SSA use-def chains) and “memory types” (subject to
306aliasing and side-effects): `“Structures in memory are accessed using ‘load’ and
307‘store’ by getting a pointer to a field with the llvm.getelementptr instruction.
308Structures in registers are accessed using the llvm.extractvalue and
309llvm.insertvalue instructions.”`
310
311When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
312vectors in memory. For `n-D`, vectors values that live in registers we can use
313`vector.extract` and `vector.insert` which do not accept dynamic indices. Note
314that this is consistent with hardware considerations as discussed below.
315
316An alternative is to use an LLVM `1-D` `vector` type for which one can use
317`llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
318operations accept dynamic indices. The implication is that one has to use a
319flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
320
321There are multiple tradeoffs involved that mix implications on the programming
322model, execution on actual HW and what is visible or hidden from codegen. They
323are discussed in the following sections.
324
325### Nested Aggregate
326
327Pros:
328
3291.  Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
3302.  No need for linearization / delinearization logic inserted everywhere.
3313.  `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
3324.  `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over `1-D`
333    vector type is natural.
334
335Cons:
336
3371.  `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices but
338    only static ones.
3392.  Dynamic indexing on the non-most-minor dimension requires roundtrips to
340    memory.
3413.  Special intrinsics and native instructions in LLVM operate on `1-D` vectors.
342    This is not expected to be a practical limitation thanks to a `vector.cast
343    %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens
344    the most minor dimensions (see the bigger picture in implications on
345    codegen).
346
347### Flattened 1-D Vector Type
348
349Pros:
350
3511.  `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
352    is possible over the whole lowered `n-D` vector type.
3532.  Supports special intrinsics and native operations.
354
355Cons:
356
3571.  Requires linearization/delinearization logic everywhere, translations are
358    complex.
3592.  Hides away the real HW structure behind dynamic indexing: at the end of the
360    day, HW vector sizes are generally fixed and multiple vectors will be needed
361    to hold a vector that is larger than the HW.
3623.  Unlikely peephole optimizations will result in good code: arbitrary dynamic
363    accesses, especially at HW vector boundaries unlikely to result in regular
364    patterns.
365
366### Discussion
367
368#### HW Vectors and Implications on the SW and the Programming Model
369
370As of today, the LLVM model only support `1-D` vector types. This is
371unsurprising because historically, the vast majority of HW only supports `1-D`
372vector registers. We note that multiple HW vendors are in the process of
373evolving to higher-dimensional physical vectors.
374
375In the following discussion, let's assume the HW vector size is `1-D` and the SW
376vector size is `n-D`, with `n >= 1`. The same discussion would apply with `2-D`
377HW `vector` size and `n >= 2`. In this context, most HW exhibit a vector
378register file. The number of such vectors is fixed. Depending on the rank and
379sizes of the SW vector abstraction and the HW vector sizes and number of
380registers, an `n-D` SW vector type may be materialized by a mix of multiple
381`1-D` HW vector registers + memory locations at a given point in time.
382
383The implication of the physical HW constraints on the programming model are that
384one cannot index dynamically across hardware registers: a register file can
385generally not be indexed dynamically. This is because the register number is
386fixed and one either needs to unroll explicitly to obtain fixed register numbers
387or go through memory. This is a constraint familiar to CUDA programmers: when
388declaring a `private float a[4]`; and subsequently indexing with a *dynamic*
389value results in so-called **local memory** usage (i.e. roundtripping to
390memory).
391
392#### Implication on codegen
393
394MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
395vectors when lowered to LLVM. This introduces the consequences on static vs
396dynamic indexing discussed previously: `extractelement`, `insertelement` and
397`shufflevector` on `n-D` vectors in MLIR only support static indices. Dynamic
398indices are only supported on the most minor `1-D` vector but not the outer
399`(n-1)-D`. For other cases, explicit load / stores are required.
400
401The implications on codegen are as follows:
402
4031.  Loops around `vector` values are indirect addressing of vector values, they
404    must operate on explicit load / store operations over `n-D` vector types.
4052.  Once an `n-D` `vector` type is loaded into an SSA value (that may or may not
406    live in `n` registers, with or without spilling, when eventually lowered),
407    it may be unrolled to smaller `k-D` `vector` types and operations that
408    correspond to the HW. This level of MLIR codegen is related to register
409    allocation and spilling that occur much later in the LLVM pipeline.
4103.  HW may support >1-D vectors with intrinsics for indirect addressing within
411    these vectors. These can be targeted thanks to explicit `vector_cast`
412    operations from MLIR `k-D` vector types and operations to LLVM `1-D`
413    vectors + intrinsics.
414
415Alternatively, we argue that directly lowering to a linearized abstraction hides
416away the codegen complexities related to memory accesses by giving a false
417impression of magical dynamic indexing across registers. Instead we prefer to
418make those very explicit in MLIR and allow codegen to explore tradeoffs.
419Different HW will require different tradeoffs in the sizes involved in steps 1.,
4202. and 3.
421
422Decisions made at the MLIR level will have implications at a much later stage in
423LLVM (after register allocation). We do not envision to expose concerns related
424to modeling of register allocation and spilling to MLIR explicitly. Instead,
425each target will expose a set of "good" target operations and `n-D` vector
426types, associated with costs that `PatterRewriters` at the MLIR level will be
427able to target. Such costs at the MLIR level will be abstract and used for
428ranking, not for accurate performance modeling. In the future such costs will be
429learned.
430
431#### Implication on Lowering to Accelerators
432
433To target accelerators that support higher dimensional vectors natively, we can
434start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
435flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
436appropriate constant. Then, the existing lowering to LLVM-IR immediately
437applies, with extensions for accelerator-specific intrinsics.
438
439It is the role of an Accelerator-specific vector dialect (see codegen flow in
440the figure above) to lower the `vector.cast`. Accelerator -> LLVM lowering would
441then consist of a bunch of `Accelerator -> Accelerator` rewrites to perform the
442casts composed with `Accelerator -> LLVM` conversions + intrinsics that operate
443on `1-D` `vector<Kxf32>`.
444
445Some of those rewrites may need extra handling, especially if a reduction is
446involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`
447when `K != K1 * … * Kn` and some arbitrary irregular `vector.cast %0:
448vector<4x4x17xf32> to vector<Kxf32>` may introduce masking and intra-vector
449shuffling that may not be worthwhile or even feasible, i.e. infinite cost.
450
451However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K = K1 *
452… * Kn` should be close to a noop.
453
454As we start building accelerator-specific abstractions, we hope to achieve
455retargetable codegen: the same infra is used for CPU, GPU and accelerators with
456extra MLIR patterns and costs.
457
458#### Implication on calling external functions that operate on vectors
459
460It is possible (likely) that we additionally need to linearize when calling an
461external function.
462
463### Relationship to LLVM matrix type proposal.
464
465The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
466stalled until recently. In its current form, it is limited to 2-D matrix types
467and operations are implemented with LLVM intrinsics. In contrast, MLIR sits at a
468higher level of abstraction and allows the lowering of generic operations on
469generic n-D vector types from MLIR to aggregates of 1-D LLVM vectors. In the
470future, it could make sense to lower to the LLVM matrix abstraction also for CPU
471even though MLIR will continue needing higher level abstractions.
472
473On the other hand, one should note that as MLIR is moving to LLVM, this document
474could become the unifying abstraction that people should target for 1-D vectors
475and the LLVM matrix proposal can be viewed as a subset of this work.
476
477### Conclusion
478
479The flattened 1-D vector design in the LLVM matrix proposal is good in a
480HW-specific world with special intrinsics. This is a good abstraction for
481register allocation, Instruction-Level-Parallelism and
482SoftWare-Pipelining/Modulo Scheduling optimizations at the register level.
483However MLIR codegen operates at a higher level of abstraction where we want to
484target operations on coarser-grained vectors than the HW size and on which
485unroll-and-jam is applied and patterns across multiple HW vectors can be
486matched.
487
488This makes “nested aggregate type of 1-D vector” an appealing abstraction for
489lowering from MLIR because:
490
4911.  it does not hide complexity related to the buffer vs value semantics and the
492    memory subsystem and
4932.  it does not rely on LLVM to magically make all the things work from a too
494    low-level abstraction.
495
496The use of special intrinsics in a `1-D` LLVM world is still available thanks to
497an explicit `vector.cast` op.
498
499## Operations
500
501[include "Dialects/VectorOps.md"]
502