1.. _Divide_and_Conquer:
2
3Divide and Conquer
4==================
5
6
7.. container:: section
8
9
10   .. rubric:: Problem
11      :class: sectiontitle
12
13   Parallelize a divide and conquer algorithm.
14
15
16.. container:: section
17
18
19   .. rubric:: Context
20      :class: sectiontitle
21
22   Divide and conquer is widely used in serial algorithms. Common
23   examples are quicksort and mergesort.
24
25
26.. container:: section
27
28
29   .. rubric:: Forces
30      :class: sectiontitle
31
32   -  Problem can be transformed into subproblems that can be solved
33      independently.
34
35
36   -  Splitting problem or merging solutions is relatively cheap
37      compared to cost of solving the subproblems.
38
39
40.. container:: section
41
42
43   .. rubric:: Solution
44      :class: sectiontitle
45
46   There are several ways to implement divide and conquer in
47   |full_name|. The best choice depends upon circumstances.
48
49
50   -  If division always yields the same number of subproblems, use
51      recursion and ``oneapi::tbb::parallel_invoke``.
52
53
54   -  If the number of subproblems varies, use recursion and
55      ``oneapi::tbb::task_group``.
56
57
58.. container:: section
59
60
61   .. rubric:: Example
62      :class: sectiontitle
63
64   Quicksort is a classic divide-and-conquer algorithm. It divides a
65   sorting problem into two subsorts. A simple serial version looks like [1]_.
66
67
68   ::
69
70
71      void SerialQuicksort( T* begin, T* end ) {
72         if( end-begin>1  ) {
73             using namespace std;
74             T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
75             swap( *begin, mid[-1] );
76             SerialQuicksort( begin, mid-1 );
77             SerialQuicksort( mid, end );
78         }
79      }
80
81
82   The number of subsorts is fixed at two, so ``oneapi::tbb::parallel_invoke``
83   provides a simple way to parallelize it. The parallel code is shown
84   below:
85
86
87   ::
88
89
90      void ParallelQuicksort( T* begin, T* end ) {
91         if( end-begin>1 ) {
92             using namespace std;
93             T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
94             swap( *begin, mid[-1] );
95             oneapi::tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );},
96                                   [=]{ParallelQuicksort( mid, end );} );
97         }
98      }
99
100
101   Eventually the subsorts become small enough that serial execution is
102   more efficient. The following variation, does sorts of less than 500 elements using the earlier serial code.
103
104
105   ::
106
107
108      void ParallelQuicksort( T* begin, T* end ) {
109         if( end-begin>=500 ) {
110             using namespace std;
111             T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
112             swap( *begin, mid[-1] );
113             oneapi::tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );},
114                                   [=]{ParallelQuicksort( mid, end );} );
115         } else {
116             SerialQuicksort( begin, end );
117         }
118      }
119
120
121   The change is an instance of the Agglomeration pattern.
122
123
124   The next example considers a problem where there are a variable
125   number of subproblems. The problem involves a tree-like description
126   of a mechanical assembly. There are two kinds of nodes:
127
128
129   -  Leaf nodes represent individual parts.
130
131
132   -  Internal nodes represent groups of parts.
133
134
135   The problem is to find all nodes that collide with a target node. The
136   following code shows a serial solution that walks the tree. It
137   records in ``Hits`` any nodes that collide with ``Target``.
138
139
140   ::
141
142
143      std::list<Node*> Hits;
144      Node* Target;
145       
146
147      void SerialFindCollisions( Node& x ) {
148         if( x.is_leaf() ) {
149             if( x.collides_with( *Target ) )
150                 Hits.push_back(&x);
151         } else {
152             for( Node::const_iterator y=x.begin();y!=x.end(); ++y )
153                 SerialFindCollisions(*y);
154         }
155      }
156
157
158   A parallel version is shown below.
159
160
161   ::
162
163
164      typedef oneapi::tbb::enumerable_thread_specific<std::list<Node*> > LocalList;
165      LocalList LocalHits;
166      Node* Target;    // Target node
167       
168
169      void ParallelWalk( Node& x ) {
170         if( x.is_leaf() ) {
171             if( x.collides_with( *Target ) )
172                 LocalHits.local().push_back(&x);
173         } else {
174             // Recurse on each child y of x in parallel
175             oneapi::tbb::task_group g;
176             for( Node::const_iterator y=x.begin(); y!=x.end(); ++y )
177                 g.run( [=]{ParallelWalk(*y);} );
178             // Wait for recursive calls to complete
179             g.wait();
180         }
181      }
182       
183
184      void ParallelFindCollisions( Node& x ) {
185         ParallelWalk(x);
186         for(LocalList::iterator i=LocalHits.begin();i!=LocalHits.end(); ++i)
187             Hits.splice( Hits.end(), *i );
188      }
189
190
191   The recursive walk is parallelized using class ``task_group`` to do
192   recursive calls in parallel.
193
194
195   There is another significant change because of the parallelism that
196   is introduced. Because it would be unsafe to update ``Hits``
197   concurrently, the parallel walk uses variable ``LocalHits`` to
198   accumulate results. Because it is of type
199   ``enumerable_thread_specific``, each thread accumulates its own
200   private result. The results are spliced together into Hits after the
201   walk completes.
202
203
204   The results will *not* be in the same order as the original serial
205   code.
206
207
208   If parallel overhead is high, use the agglomeration pattern. For
209   example, use the serial walk for subtrees under a certain threshold.
210
211
212.. [1] Production quality quicksort implementations typically
213   use more sophisticated pivot selection, explicit stacks instead of
214   recursion, and some other sorting algorithm for small subsorts. The
215   simple algorithm is used here to focus on exposition of the parallel
216   pattern.
217
218