1.. _Reduction: 2 3Reduction 4========= 5 6 7.. container:: section 8 9 10 .. rubric:: Problem 11 :class: sectiontitle 12 13 Perform an associative reduction operation across a data set. 14 15 16.. container:: section 17 18 19 .. rubric:: Context 20 :class: sectiontitle 21 22 Many serial algorithms sweep over a set of items to collect summary 23 information. 24 25 26.. container:: section 27 28 29 .. rubric:: Forces 30 :class: sectiontitle 31 32 The summary can be expressed as an associative operation over the 33 data set, or at least is close enough to associative that 34 reassociation does not matter. 35 36 37.. container:: section 38 39 40 .. rubric:: Solution 41 :class: sectiontitle 42 43 Two solutions exist in |full_name|. 44 The choice on which to use depends upon several considerations: 45 46 47 - Is the operation commutative as well as associative? 48 49 50 - Are instances of the reduction type expensive to construct and 51 destroy. For example, a floating point number is inexpensive to 52 construct. A sparse floating-point matrix might be very expensive 53 to construct. 54 55 56 Use ``oneapi::tbb::parallel_reduce`` when the objects are inexpensive to 57 construct. It works even if the reduction operation is not 58 commutative. 59 60 61 Use ``oneapi::tbb::parallel_for`` and ``oneapi::tbb::combinable`` if the reduction 62 operation is commutative and instances of the type are expensive. 63 64 65 If the operation is not precisely associative but a precisely 66 deterministic result is required, use recursive reduction and 67 parallelize it using ``oneapi::tbb::parallel_invoke``. 68 69 70.. container:: section 71 72 73 .. rubric:: Examples 74 :class: sectiontitle 75 76 The examples presented here illustrate the various solutions and some 77 tradeoffs. 78 79 80 The first example uses ``oneapi::tbb::parallel_reduce`` to do a + reduction 81 over sequence of type ``T``. The sequence is defined by a half-open 82 interval [first,last). 83 84 85 :: 86 87 88 T AssociativeReduce( const T* first, const T* last, T identity ) { 89 return oneapi::tbb::parallel_reduce( 90 // Index range for reduction 91 oneapi::tbb::blocked_range<const T*>(first,last), 92 // Identity element 93 identity, 94 // Reduce a subrange and partial sum 95 [&]( oneapi::tbb::blocked_range<const T*> r, T partial_sum )->float { 96 return std::accumulate( r.begin(), r.end(), partial_sum ); 97 }, 98 // Reduce two partial sums 99 std::plus<T>() 100 ); 101 } 102 103 104 The third and fourth arguments to this form of ``parallel_reduce`` 105 are a built in form of the agglomeration pattern. If there is an 106 elementwise action to be performed before the reduction, 107 incorporating it into the third argument (reduction of a subrange) 108 may improve performance because of better locality of reference. Note 109 that the block size for agglomeration is not explicitly specified; 110 ``parallel_reduce`` defines blocks automatically with the help of 111 implicitly used ``oneapi::tbb::auto_partitioner``. 112 113 114 The second example assumes the + is commutative on ``T``. It is a 115 good solution when ``T`` objects are expensive to construct. 116 117 118 :: 119 120 121 T CombineReduce( const T* first, const T* last, T identity ) { 122 oneapi::tbb::combinable<T> sum(identity); 123 oneapi::tbb::parallel_for( 124 oneapi::tbb::blocked_range<const T*>(first,last), 125 [&]( oneapi::tbb::blocked_range<const T*> r ) { 126 sum.local() += std::accumulate(r.begin(), r.end(), identity); 127 } 128 ); 129 return sum.combine( []( const T& x, const T& y ) {return x+y;} ); 130 } 131 132 133 Sometimes it is desirable to destructively use the partial results to 134 generate the final result. For example, if the partial results are 135 lists, they can be spliced together to form the final result. In that 136 case use class ``oneapi::tbb::enumerable_thread_specific`` instead of 137 ``combinable``. The ``ParallelFindCollisions`` example in :ref:`Divide_and_Conquer` 138 demonstrates the technique. 139 140 141 Floating-point addition and multiplication are almost associative. 142 Reassociation can cause changes because of rounding effects. The 143 techniques shown so far reassociate terms non-deterministically. 144 Fully deterministic parallel reduction for a not quite associative 145 operation requires using deterministic reassociation. The code below 146 demonstrates this in the form of a template that does a + reduction 147 over a sequence of values of type ``T``. 148 149 150 :: 151 152 153 template<typename T> 154 T RepeatableReduce( const T* first, const T* last, T identity ) { 155 if( last-first<=1000 ) { 156 // Use serial reduction 157 return std::accumulate( first, last, identity ); 158 } else { 159 // Do parallel divide-and-conquer reduction 160 const T* mid = first+(last-first)/2; 161 T left, right; 162 oneapi::tbb::parallel_invoke( 163 [&]{left=RepeatableReduce(first,mid,identity);}, 164 [&]{right=RepeatableReduce(mid,last,identity);} 165 ); 166 return left+right; 167 } 168 } 169 170 171 The outer if-else is an instance of the agglomeration pattern for 172 recursive computations. The reduction graph, though not a strict 173 binary tree, is fully deterministic. Thus the result will always be 174 the same for a given input sequence, assuming all threads do 175 identical floating-point rounding. 176 177 178 ``oneapi::tbb::parallel_deterministic_reduce`` is a simpler and more 179 efficient way to get reproducible non-associative reduction. It is 180 very similar to ``oneapi::tbb::parallel_reduce`` but, unlike the latter, 181 builds a deterministic reduction graph. With it, the 182 ``RepeatableReduce`` sample can be almost identical to 183 ``AssociativeReduce``: 184 185 186 :: 187 188 189 template<typename T> 190 T RepeatableReduce( const T* first, const T* last, T identity ) { 191 return oneapi::tbb::parallel_deterministic_reduce( 192 // Index range for reduction 193 oneapi::tbb::blocked_range<const T*>(first,last,1000), 194 // Identity element 195 identity, 196 // Reduce a subrange and partial sum 197 [&]( oneapi::tbb::blocked_range<const T*> r, T partial_sum )->float { 198 return std::accumulate( r.begin(), r.end(), partial_sum ); 199 }, 200 // Reduce two partial sums 201 std::plus<T>() 202 ); 203 } 204 205 206 Besides the function name change, note the grain size of 1000 207 specified for ``oneapi::tbb::blocked_range``. It defines the desired block 208 size for agglomeration; automatic block size selection is not used 209 due to non-determinism. 210 211 212 The final example shows how a problem that typically is not viewed as 213 a reduction can be parallelized by viewing it as a reduction. The 214 problem is retrieving floating-point exception flags for a 215 computation across a data set. The serial code might look something 216 like: 217 218 219 :: 220 221 222 feclearexcept(FE_ALL_EXCEPT); 223 for( int i=0; i<N; ++i ) 224 C[i]=A[i]*B[i]; 225 int flags = fetestexcept(FE_ALL_EXCEPT); 226 if (flags & FE_DIVBYZERO) ...; 227 if (flags & FE_OVERFLOW) ...; 228 ... 229 230 231 The code can be parallelized by computing chunks of the loop 232 separately, and merging floating-point flags from each chunk. To do 233 this with ``tbb:parallel_reduce``, first define a "body" type, as 234 shown below. 235 236 237 :: 238 239 240 struct ComputeChunk { 241 int flags; // Holds floating-point exceptions seen so far. 242 void reset_fpe() { 243 flags=0; 244 feclearexcept(FE_ALL_EXCEPT); 245 } 246 ComputeChunk () { 247 reset_fpe(); 248 } 249 // "Splitting constructor"called by parallel_reduce when splitting a range into subranges. 250 ComputeChunk ( const ComputeChunk&, oneapi::tbb::split ) { 251 reset_fpe(); 252 } 253 // Operates on a chunk and collects floating-point exception state into flags member. 254 void operator()( oneapi::tbb::blocked_range<int> r ) { 255 int end=r.end(); 256 for( int i=r.begin(); i!=end; ++i ) 257 C[i] = A[i]/B[i]; 258 // It is critical to do |= here, not =, because otherwise we 259 // might lose earlier exceptions from the same thread. 260 flags |= fetestexcept(FE_ALL_EXCEPT); 261 } 262 // Called by parallel_reduce when joining results from two subranges. 263 void join( Body& other ) { 264 flags |= other.flags; 265 } 266 }; 267 268 269 Then invoke it as follows: 270 271 272 :: 273 274 275 // Construction of cc implicitly resets FP exception state. 276 ComputeChunk cc; 277 oneapi::tbb::parallel_reduce( oneapi::tbb::blocked_range<int>(0,N), cc ); 278 if (cc.flags & FE_DIVBYZERO) ...; 279 if (cc.flags & FE_OVERFLOW) ...; 280 ... 281 282