1.. _parallel_reduce: 2 3parallel_reduce 4=============== 5 6 7A loop can do a reduction, as in this summation: 8 9 10:: 11 12 13 float SerialSumFoo( float a[], size_t n ) { 14 float sum = 0; 15 for( size_t i=0; i!=n; ++i ) 16 sum += Foo(a[i]); 17 return sum; 18 } 19 20 21If the iterations are independent, you can parallelize this loop using 22the template class ``parallel_reduce`` as follows: 23 24 25:: 26 27 28 float ParallelSumFoo( const float a[], size_t n ) { 29 SumFoo sf(a); 30 parallel_reduce( blocked_range<size_t>(0,n), sf ); 31 return sf.my_sum; 32 } 33 34 35The class ``SumFoo`` specifies details of the reduction, such as how to 36accumulate subsums and combine them. Here is the definition of class 37``SumFoo``: 38 39 40:: 41 42 43 class SumFoo { 44 float* my_a; 45 public: 46 float my_sum; 47 void operator()( const blocked_range<size_t>& r ) { 48 float *a = my_a; 49 float sum = my_sum; 50 size_t end = r.end(); 51 for( size_t i=r.begin(); i!=end; ++i ) 52 sum += Foo(a[i]); 53 my_sum = sum; 54 } 55 56 57 SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {} 58 59 60 void join( const SumFoo& y ) {my_sum+=y.my_sum;} 61 62 63 SumFoo(float a[] ) : 64 my_a(a), my_sum(0) 65 {} 66 }; 67 68 69Note the differences with class ``ApplyFoo`` from parallel_for. First, 70``operator()`` is *not* ``const``. This is because it must update 71SumFoo::my_sum. Second, ``SumFoo`` has a *splitting constructor* and a 72method ``join`` that must be present for ``parallel_reduce`` to work. 73The splitting constructor takes as arguments a reference to the original 74object, and a dummy argument of type ``split``, which is defined by the 75library. The dummy argument distinguishes the splitting constructor from 76a copy constructor. 77 78 79.. tip:: 80 In the example, the definition of ``operator()`` uses local temporary 81 variables (``a``, ``sum``, ``end``) for scalar values accessed inside 82 the loop. This technique can improve performance by making it obvious 83 to the compiler that the values can be held in registers instead of 84 memory. If the values are too large to fit in registers, or have 85 their address taken in a way the compiler cannot track, the technique 86 might not help. With a typical optimizing compiler, using local 87 temporaries for only written variables (such as ``sum`` in the 88 example) can suffice, because then the compiler can deduce that the 89 loop does not write to any of the other locations, and hoist the 90 other reads to outside the loop. 91 92 93When a worker thread is available, as decided by the task scheduler, 94``parallel_reduce`` invokes the splitting constructor to create a 95subtask for the worker. When the subtask completes, ``parallel_reduce`` 96uses method ``join`` to accumulate the result of the subtask. The graph 97at the top of the following figure shows the split-join sequence that 98happens when a worker is available: 99 100 101.. container:: fignone 102 :name: fig5 103 104 105 Graph of the Split-join Sequence 106 |image0| 107 108 109An arrows in the above figure indicate order in time. The splitting 110constructor might run concurrently while object ``x`` is being used for the 111first half of the reduction. Therefore, all actions of the splitting 112constructor that creates y must be made thread safe with respect to ``x``. 113So if the splitting constructor needs to increment a reference count 114shared with other objects, it should use an atomic increment. 115 116 117If a worker is not available, the second half of the iteration is 118reduced using the same body object that reduced the first half. That is 119the reduction of the second half starts where reduction of the first 120half finished. 121 122 123.. CAUTION:: 124 Since split/join are not used if workers are unavailable, 125 ``parallel_reduce`` does not necessarily do recursive splitting. 126 127 128.. CAUTION:: 129 Since the same body might be used to accumulate multiple subranges, 130 it is critical that ``operator()`` not discard earlier accumulations. 131 The code below shows an incorrect definition of 132 ``SumFoo::operator()``. 133 134 135:: 136 137 138 class SumFoo { 139 ... 140 public: 141 float my_sum; 142 void operator()( const blocked_range<size_t>& r ) { 143 ... 144 float sum = 0; // WRONG – should be 'sum = my_sum". 145 ... 146 for( ... ) 147 sum += Foo(a[i]); 148 my_sum = sum; 149 } 150 ... 151 }; 152 153 154With the mistake, the body returns a partial sum for the last subrange 155instead of all subranges to which ``parallel_reduce`` applies it. 156 157 158The rules for partitioners and grain sizes for ``parallel_reduce`` are 159the same as for ``parallel_for``. 160 161 162``parallel_reduce`` generalizes to any associative operation. In 163general, the splitting constructor does two things: 164 165 166- Copy read-only information necessary to run the loop body. 167 168 169- Initialize the reduction variable(s) to the identity element of the 170 operation(s). 171 172 173The join method should do the corresponding merge(s). You can do more 174than one reduction at the same time: you can gather the min and max with 175a single ``parallel_reduce``. 176 177 178.. note:: 179 The reduction operation can be non-commutative. The example still 180 works if floating-point addition is replaced by string concatenation. 181 182 183.. |image0| image:: Images/image009.jpg 184 :width: 512px 185 :height: 438px 186 187