1.. _Reduction:
2
3Reduction
4=========
5
6
7.. container:: section
8
9
10   .. rubric:: Problem
11      :class: sectiontitle
12
13   Perform an associative reduction operation across a data set.
14
15
16.. container:: section
17
18
19   .. rubric:: Context
20      :class: sectiontitle
21
22   Many serial algorithms sweep over a set of items to collect summary
23   information.
24
25
26.. container:: section
27
28
29   .. rubric:: Forces
30      :class: sectiontitle
31
32   The summary can be expressed as an associative operation over the
33   data set, or at least is close enough to associative that
34   reassociation does not matter.
35
36
37.. container:: section
38
39
40   .. rubric:: Solution
41      :class: sectiontitle
42
43   Two solutions exist in |full_name|.
44   The choice on which to use depends upon several considerations:
45
46
47   -  Is the operation commutative as well as associative?
48
49
50   -  Are instances of the reduction type expensive to construct and
51      destroy. For example, a floating point number is inexpensive to
52      construct. A sparse floating-point matrix might be very expensive
53      to construct.
54
55
56   Use ``oneapi::tbb::parallel_reduce`` when the objects are inexpensive to
57   construct. It works even if the reduction operation is not
58   commutative.
59
60
61   Use ``oneapi::tbb::parallel_for`` and ``oneapi::tbb::combinable`` if the reduction
62   operation is commutative and instances of the type are expensive.
63
64
65   If the operation is not precisely associative but a precisely
66   deterministic result is required, use recursive reduction and
67   parallelize it using ``oneapi::tbb::parallel_invoke``.
68
69
70.. container:: section
71
72
73   .. rubric:: Examples
74      :class: sectiontitle
75
76   The examples presented here illustrate the various solutions and some
77   tradeoffs.
78
79
80   The first example uses ``oneapi::tbb::parallel_reduce`` to do a + reduction
81   over sequence of type ``T``. The sequence is defined by a half-open
82   interval [first,last).
83
84
85   ::
86
87
88      T AssociativeReduce( const T* first, const T* last, T identity ) {
89         return oneapi::tbb::parallel_reduce(
90             // Index range for reduction
91             oneapi::tbb::blocked_range<const T*>(first,last),
92             // Identity element
93             identity,
94             // Reduce a subrange and partial sum
95             [&]( oneapi::tbb::blocked_range<const T*> r, T partial_sum )->float {
96                 return std::accumulate( r.begin(), r.end(), partial_sum );
97             },
98             // Reduce two partial sums
99             std::plus<T>()
100         );
101      }
102
103
104   The third and fourth arguments to this form of ``parallel_reduce``
105   are a built in form of the agglomeration pattern. If there is an
106   elementwise action to be performed before the reduction,
107   incorporating it into the third argument (reduction of a subrange)
108   may improve performance because of better locality of reference. Note
109   that the block size for agglomeration is not explicitly specified;
110   ``parallel_reduce`` defines blocks automatically with the help of
111   implicitly used ``oneapi::tbb::auto_partitioner``.
112
113
114   The second example assumes the + is commutative on ``T``. It is a
115   good solution when ``T`` objects are expensive to construct.
116
117
118   ::
119
120
121      T CombineReduce( const T* first, const T* last, T identity ) {
122         oneapi::tbb::combinable<T> sum(identity);
123         oneapi::tbb::parallel_for(
124             oneapi::tbb::blocked_range<const T*>(first,last),
125             [&]( oneapi::tbb::blocked_range<const T*> r ) {
126                 sum.local() += std::accumulate(r.begin(), r.end(), identity);
127             }
128         );
129         return sum.combine( []( const T& x, const T& y ) {return x+y;} );
130      }
131
132
133   Sometimes it is desirable to destructively use the partial results to
134   generate the final result. For example, if the partial results are
135   lists, they can be spliced together to form the final result. In that
136   case use class ``oneapi::tbb::enumerable_thread_specific`` instead of
137   ``combinable``. The ``ParallelFindCollisions`` example in :ref:`Divide_and_Conquer`
138   demonstrates the technique.
139
140
141   Floating-point addition and multiplication are almost associative.
142   Reassociation can cause changes because of rounding effects. The
143   techniques shown so far reassociate terms non-deterministically.
144   Fully deterministic parallel reduction for a not quite associative
145   operation requires using deterministic reassociation. The code below
146   demonstrates this in the form of a template that does a + reduction
147   over a sequence of values of type ``T``.
148
149
150   ::
151
152
153      template<typename T>
154      T RepeatableReduce( const T* first, const T* last, T identity ) {
155         if( last-first<=1000 ) {
156             // Use serial reduction
157             return std::accumulate( first, last, identity );
158         } else {
159             // Do parallel divide-and-conquer reduction
160             const T* mid = first+(last-first)/2;
161             T left, right;
162             oneapi::tbb::parallel_invoke(
163                 [&]{left=RepeatableReduce(first,mid,identity);},
164                 [&]{right=RepeatableReduce(mid,last,identity);}
165             );
166             return left+right;
167         }
168      }
169
170
171   The outer if-else is an instance of the agglomeration pattern for
172   recursive computations. The reduction graph, though not a strict
173   binary tree, is fully deterministic. Thus the result will always be
174   the same for a given input sequence, assuming all threads do
175   identical floating-point rounding.
176
177
178   ``oneapi::tbb::parallel_deterministic_reduce`` is a simpler and more
179   efficient way to get reproducible non-associative reduction. It is
180   very similar to ``oneapi::tbb::parallel_reduce`` but, unlike the latter,
181   builds a deterministic reduction graph. With it, the
182   ``RepeatableReduce`` sample can be almost identical to
183   ``AssociativeReduce``:
184
185
186   ::
187
188
189      template<typename T>
190      T RepeatableReduce( const T* first, const T* last, T identity ) {
191         return oneapi::tbb::parallel_deterministic_reduce(
192             // Index range for reduction
193             oneapi::tbb::blocked_range<const T*>(first,last,1000),
194             // Identity element
195             identity,
196             // Reduce a subrange and partial sum
197             [&]( oneapi::tbb::blocked_range<const T*> r, T partial_sum )->float {
198                 return std::accumulate( r.begin(), r.end(), partial_sum );
199             },
200             // Reduce two partial sums
201             std::plus<T>()
202         );
203      }
204
205
206   Besides the function name change, note the grain size of 1000
207   specified for ``oneapi::tbb::blocked_range``. It defines the desired block
208   size for agglomeration; automatic block size selection is not used
209   due to non-determinism.
210
211
212   The final example shows how a problem that typically is not viewed as
213   a reduction can be parallelized by viewing it as a reduction. The
214   problem is retrieving floating-point exception flags for a
215   computation across a data set. The serial code might look something
216   like:
217
218
219   ::
220
221
222         feclearexcept(FE_ALL_EXCEPT);
223         for( int i=0; i<N; ++i )
224             C[i]=A[i]*B[i];
225         int flags = fetestexcept(FE_ALL_EXCEPT);
226         if (flags & FE_DIVBYZERO) ...;
227         if (flags & FE_OVERFLOW) ...;
228         ...
229
230
231   The code can be parallelized by computing chunks of the loop
232   separately, and merging floating-point flags from each chunk. To do
233   this with ``tbb:parallel_reduce``, first define a "body" type, as
234   shown below.
235
236
237   ::
238
239
240      struct ComputeChunk {
241         int flags;          // Holds floating-point exceptions seen so far.
242         void reset_fpe() {
243             flags=0;
244             feclearexcept(FE_ALL_EXCEPT);
245         }
246         ComputeChunk () {
247             reset_fpe();
248         }
249         // "Splitting constructor"called by parallel_reduce when splitting a range into subranges.
250         ComputeChunk ( const ComputeChunk&, oneapi::tbb::split ) {
251             reset_fpe();
252         }
253         // Operates on a chunk and collects floating-point exception state into flags member.
254         void operator()( oneapi::tbb::blocked_range<int> r ) {
255             int end=r.end();
256             for( int i=r.begin(); i!=end; ++i )
257                 C[i] = A[i]/B[i];
258             // It is critical to do |= here, not =, because otherwise we
259             // might lose earlier exceptions from the same thread.
260             flags |= fetestexcept(FE_ALL_EXCEPT);
261         }
262         // Called by parallel_reduce when joining results from two subranges.
263         void join( Body& other ) {
264             flags |= other.flags;
265         }
266      };
267
268
269   Then invoke it as follows:
270
271
272   ::
273
274
275      // Construction of cc implicitly resets FP exception state.
276         ComputeChunk cc;
277         oneapi::tbb::parallel_reduce( oneapi::tbb::blocked_range<int>(0,N), cc );
278         if (cc.flags & FE_DIVBYZERO) ...;
279         if (cc.flags & FE_OVERFLOW) ...;
280         ...
281
282