1.. _parallel_reduce:
2
3parallel_reduce
4===============
5
6
7A loop can do a reduction, as in this summation:
8
9
10::
11
12
13   float SerialSumFoo( float a[], size_t n ) {
14       float sum = 0;
15       for( size_t i=0; i!=n; ++i )
16           sum += Foo(a[i]);
17       return sum;
18   }
19
20
21If the iterations are independent, you can parallelize this loop using
22the template class ``parallel_reduce`` as follows:
23
24
25::
26
27
28   float ParallelSumFoo( const float a[], size_t n ) {
29       SumFoo sf(a);
30       parallel_reduce( blocked_range<size_t>(0,n), sf );
31       return sf.my_sum;
32   }
33
34
35The class ``SumFoo`` specifies details of the reduction, such as how to
36accumulate subsums and combine them. Here is the definition of class
37``SumFoo``:
38
39
40::
41
42
43   class SumFoo {
44       float* my_a;
45   public:
46       float my_sum;
47       void operator()( const blocked_range<size_t>& r ) {
48           float *a = my_a;
49           float sum = my_sum;
50           size_t end = r.end();
51           for( size_t i=r.begin(); i!=end; ++i )
52               sum += Foo(a[i]);
53           my_sum = sum;
54       }
55    
56
57       SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}
58    
59
60       void join( const SumFoo& y ) {my_sum+=y.my_sum;}
61
62
63       SumFoo(float a[] ) :
64           my_a(a), my_sum(0)
65       {}
66   };
67
68
69Note the differences with class ``ApplyFoo`` from parallel_for. First,
70``operator()`` is *not* ``const``. This is because it must update
71SumFoo::my_sum. Second, ``SumFoo`` has a *splitting constructor* and a
72method ``join`` that must be present for ``parallel_reduce`` to work.
73The splitting constructor takes as arguments a reference to the original
74object, and a dummy argument of type ``split``, which is defined by the
75library. The dummy argument distinguishes the splitting constructor from
76a copy constructor.
77
78
79.. tip::
80   In the example, the definition of ``operator()`` uses local temporary
81   variables (``a``, ``sum``, ``end``) for scalar values accessed inside
82   the loop. This technique can improve performance by making it obvious
83   to the compiler that the values can be held in registers instead of
84   memory. If the values are too large to fit in registers, or have
85   their address taken in a way the compiler cannot track, the technique
86   might not help. With a typical optimizing compiler, using local
87   temporaries for only written variables (such as ``sum`` in the
88   example) can suffice, because then the compiler can deduce that the
89   loop does not write to any of the other locations, and hoist the
90   other reads to outside the loop.
91
92
93When a worker thread is available, as decided by the task scheduler,
94``parallel_reduce`` invokes the splitting constructor to create a
95subtask for the worker. When the subtask completes, ``parallel_reduce``
96uses method ``join`` to accumulate the result of the subtask. The graph
97at the top of the following figure shows the split-join sequence that
98happens when a worker is available:
99
100
101.. container:: fignone
102   :name: fig5
103
104
105   Graph of the Split-join Sequence
106   |image0|
107
108
109An arrows in the above figure indicate order in time. The splitting
110constructor might run concurrently while object ``x`` is being used for the
111first half of the reduction. Therefore, all actions of the splitting
112constructor that creates y must be made thread safe with respect to ``x``.
113So if the splitting constructor needs to increment a reference count
114shared with other objects, it should use an atomic increment.
115
116
117If a worker is not available, the second half of the iteration is
118reduced using the same body object that reduced the first half. That is
119the reduction of the second half starts where reduction of the first
120half finished.
121
122
123.. CAUTION::
124   Since split/join are not used if workers are unavailable,
125   ``parallel_reduce`` does not necessarily do recursive splitting.
126
127
128.. CAUTION::
129   Since the same body might be used to accumulate multiple subranges,
130   it is critical that ``operator()`` not discard earlier accumulations.
131   The code below shows an incorrect definition of
132   ``SumFoo::operator()``.
133
134
135::
136
137
138   class SumFoo {
139       ...
140   public:
141       float my_sum;
142       void operator()( const blocked_range<size_t>& r ) {
143           ...
144           float sum = 0;  // WRONG – should be 'sum = my_sum".
145           ...
146           for( ... )
147               sum += Foo(a[i]);
148           my_sum = sum;
149       }
150       ...
151   };
152
153
154With the mistake, the body returns a partial sum for the last subrange
155instead of all subranges to which ``parallel_reduce`` applies it.
156
157
158The rules for partitioners and grain sizes for ``parallel_reduce`` are
159the same as for ``parallel_for``.
160
161
162``parallel_reduce`` generalizes to any associative operation. In
163general, the splitting constructor does two things:
164
165
166-  Copy read-only information necessary to run the loop body.
167
168
169-  Initialize the reduction variable(s) to the identity element of the
170   operation(s).
171
172
173The join method should do the corresponding merge(s). You can do more
174than one reduction at the same time: you can gather the min and max with
175a single ``parallel_reduce``.
176
177
178.. note::
179   The reduction operation can be non-commutative. The example still
180   works if floating-point addition is replaced by string concatenation.
181
182
183.. |image0| image:: Images/image009.jpg
184   :width: 512px
185   :height: 438px
186
187