1This note attempts to describe the motivation for and design of an
2implementation of Fortran 90 (and later) array expression evaluation that
3minimizes the use of dynamically allocated temporary storage for
4the results of calls to transformational intrinsic functions, and
5making them more amenable to acceleration.
6
7The transformational intrinsic functions of Fortran of interest to
8us here include:
9
10* Reductions to scalars (`SUM(X)`, also `ALL`, `ANY`, `COUNT`,
11  `DOT_PRODUCT`,
12  `IALL`, `IANY`, `IPARITY`, `MAXVAL`, `MINVAL`, `PARITY`, `PRODUCT`)
13* Axial reductions (`SUM(X,DIM=)`, &c.)
14* Location reductions to indices (`MAXLOC`, `MINLOC`, `FINDLOC`)
15* Axial location reductions (`MAXLOC(DIM=`, &c.)
16* `TRANSPOSE(M)` matrix transposition
17* `RESHAPE` without `ORDER=`
18* `RESHAPE` with `ORDER=`
19* `CSHIFT` and `EOSHIFT` with scalar `SHIFT=`
20* `CSHIFT` and `EOSHIFT` with array-valued `SHIFT=`
21* `PACK` and `UNPACK`
22* `MATMUL`
23* `SPREAD`
24
25Other Fortran intrinsic functions are technically transformational (e.g.,
26`COMMAND_ARGUMENT_COUNT`) but not of interest for this note.
27The generic `REDUCE` is also not considered here.
28
29Arrays as functions
30===================
31A whole array can be viewed as a function that maps its indices to the values
32of its elements.
33Specifically, it is a map from a tuple of integers to its element type.
34The rank of the array is the number of elements in that tuple,
35and the shape of the array delimits the domain of the map.
36
37`REAL :: A(N,M)` can be seen as a function mapping ordered pairs of integers
38`(J,K)` with `1<=J<=N` and `1<=J<=M` to real values.
39
40Array expressions as functions
41==============================
42The same perspective can be taken of an array expression comprising
43intrinsic operators and elemental functions.
44Fortran doesn't allow one to apply subscripts directly to an expression,
45but expressions have rank and shape, and one can view array expressions
46as functions over index tuples by applying those indices to the arrays
47and subexpressions in the expression.
48
49Consider `B = A + 1.0` (assuming `REAL :: A(N,M), B(N,M)`).
50The right-hand side of that assignment could be evaluated into a
51temporary array `T` and then subscripted as it is copied into `B`.
52```
53REAL, ALLOCATABLE :: T(:,:)
54ALLOCATE(T(N,M))
55DO CONCURRENT(J=1:N,K=1:M)
56  T(J,K)=A(J,K) + 1.0
57END DO
58DO CONCURRENT(J=1:N,K=1:M)
59  B(J,K)=T(J,K)
60END DO
61DEALLOCATE(T)
62```
63But we can avoid the allocation, population, and deallocation of
64the temporary by treating the right-hand side expression as if it
65were a statement function `F(J,K)=A(J,K)+1.0` and evaluating
66```
67DO CONCURRENT(J=1:N,K=1:M)
68  A(J,K)=F(J,K)
69END DO
70```
71
72In general, when a Fortran array assignment to a non-allocatable array
73does not include the left-hand
74side variable as an operand of the right-hand side expression, and any
75function calls on the right-hand side are elemental or scalar-valued,
76we can avoid the use of a temporary.
77
78Transformational intrinsic functions as function composition
79============================================================
80Many of the transformational intrinsic functions listed above
81can, when their array arguments are viewed as functions over their
82index tuples, be seen as compositions of those functions with
83functions of the "incoming" indices -- yielding a function for
84an entire right-hand side of an array assignment statement.
85
86For example, the application of `TRANSPOSE(A + 1.0)` to the index
87tuple `(J,K)` becomes `A(K,J) + 1.0`.
88
89Partial (axial) reductions can be similarly composed.
90The application of `SUM(A,DIM=2)` to the index `J` is the
91complete reduction `SUM(A(J,:))`.
92
93More completely:
94* Reductions to scalars (`SUM(X)` without `DIM=`) become
95  runtime calls; the result needs no dynamic allocation,
96  being a scalar.
97* Axial reductions (`SUM(X,DIM=d)`) applied to indices `(J,K)`
98  become scalar values like `SUM(X(J,K,:))` if `d=3`.
99* Location reductions to indices (`MAXLOC(X)` without `DIM=`)
100  do not require dynamic allocation, since their results are
101  either scalar or small vectors of length `RANK(X)`.
102* Axial location reductions (`MAXLOC(X,DIM=)`, &c.)
103  are handled like other axial reductions like `SUM(DIM=)`.
104* `TRANSPOSE(M)` exchanges the two components of the index tuple.
105* `RESHAPE(A,SHAPE=s)` without `ORDER=` must precompute the shape
106  vector `S`, and then use it to linearize indices into offsets
107  in the storage order of `A` (whose shape must also be captured).
108  These conversions can involve division and/or modulus, which
109  can be optimized into a fixed-point multiplication using the
110  usual technique.
111* `RESHAPE` with `ORDER=` is similar, but must permute the
112  components of the index tuple; it generalizes `TRANSPOSE`.
113* `CSHIFT` applies addition and modulus.
114* `EOSHIFT` applies addition and a conditional move (`MERGE`).
115* `PACK` and `UNPACK` are likely to require a runtime call.
116* `MATMUL(A,B)` can become `DOT_PRODUCT(A(J,:),B(:,K))`, but
117  might benefit from calling a highly optimized runtime
118  routine.
119* `SPREAD(A,DIM=d,NCOPIES=n)` for compile-time `d` simply
120  applies `A` to a reduced index tuple.
121
122Determination of rank and shape
123===============================
124An important part of evaluating array expressions without the use of
125temporary storage is determining the shape of the result prior to,
126or without, evaluating the elements of the result.
127
128The shapes of array objects, results of elemental intrinsic functions,
129and results of intrinsic operations are obvious.
130But it is possible to determine the shapes of the results of many
131transformational intrinsic function calls as well.
132
133* `SHAPE(SUM(X,DIM=d))` is `SHAPE(X)` with one element removed:
134  `PACK(SHAPE(X),[(j,j=1,RANK(X))]/=d)` in general.
135  (The `DIM=` argument is commonly a compile-time constant.)
136* `SHAPE(MAXLOC(X))` is `[RANK(X)]`.
137* `SHAPE(MAXLOC(X,DIM=d))` is `SHAPE(X)` with one element removed.
138* `SHAPE(TRANSPOSE(M))` is a reversal of `SHAPE(M)`.
139* `SHAPE(RESHAPE(..., SHAPE=S))` is `S`.
140* `SHAPE(CSHIFT(X))` is `SHAPE(X)`; same with `EOSHIFT`.
141* `SHAPE(PACK(A,VECTOR=V))` is `SHAPE(V)`
142* `SHAPE(PACK(A,MASK=m))` with non-scalar `m` and without `VECTOR=` is `[COUNT(m)]`.
143* `RANK(PACK(...))` is always 1.
144* `SHAPE(UNPACK(MASK=M))` is `SHAPE(M)`.
145* `SHAPE(MATMUL(A,B))` drops one value from `SHAPE(A)` and another from `SHAPE(B)`.
146* `SHAPE(SHAPE(X))` is `[RANK(X)]`.
147* `SHAPE(SPREAD(A,DIM=d,NCOPIES=n))` is `SHAPE(A)` with `n` inserted at
148  dimension `d`.
149
150This is useful because expression evaluations that *do* require temporaries
151to hold their results (due to the context in which the evaluation occurs)
152can be implemented with a separation of the allocation
153of the temporary array and the population of the array.
154The code that evaluates the expression, or that implements a transformational
155intrinsic in the runtime library, can be designed with an API that includes
156a pointer to the destination array as an argument.
157
158Statements like `ALLOCATE(A,SOURCE=expression)` should thus be capable
159of evaluating their array expressions directly into the newly-allocated
160storage for the allocatable array.
161The implementation would generate code to calculate the shape, use it
162to allocate the memory and populate the descriptor, and then drive a
163loop nest around the expression to populate the array.
164In cases where the analyzed shape is known at compile time, we should
165be able to have the opportunity to avoid heap allocation in favor of
166stack storage, if the scope of the variable is local.
167
168Automatic reallocation of allocatables
169======================================
170Fortran 2003 introduced the ability to assign non-conforming array expressions
171to ALLOCATABLE arrays with the implied semantics of reallocation to the
172new shape.
173The implementation of this feature also becomes more straightforward if
174our implementation of array expressions has decoupled calculation of shapes
175from the evaluation of the elements of the result.
176
177Rewriting rules
178===============
179Let `{...}` denote an ordered tuple of 1-based indices, e.g. `{j,k}`, into
180the result of an array expression or subexpression.
181
182* Array constructors always yield vectors; higher-rank arrays that appear as
183  constituents are flattened; so `[X] => RESHAPE(X,SHAPE=[SIZE(X)})`.
184* Array constructors with multiple constituents are concatenations of
185  their constituents; so `[X,Y]{j} => MERGE(Y{j-SIZE(X)},X{j},J>SIZE(X))`.
186* Array constructors with implied DO loops are difficult when nested
187  triangularly.
188* Whole array references can have lower bounds other than 1, so
189  `A => A(LBOUND(A,1):UBOUND(A,1),...)`.
190* Array sections simply apply indices: `A(i:...:n){j} => A(i1+n*(j-1))`.
191* Vector-valued subscripts apply indices to the subscript: `A(N(:)){j} => A(N(:){j})`.
192* Scalar operands ignore indices: `X{j,k} => X`.
193  Further, they are evaluated at most once.
194* Elemental operators and functions apply indices to their arguments:
195  `(A(:,:) + B(:,:)){j,k}` => A(:,:){j,k} + B(:,:){j,k}`.
196* `TRANSPOSE(X){j,k} => X{k,j}`.
197* `SPREAD(X,DIM=2,...){j,k} => X{j}`; i.e., the contents are replicated.
198  If X is sufficiently expensive to compute elementally, it might be evaluated
199  into a temporary.
200
201(more...)
202