1.. _Divide_and_Conquer: 2 3Divide and Conquer 4================== 5 6 7.. container:: section 8 9 10 .. rubric:: Problem 11 :class: sectiontitle 12 13 Parallelize a divide and conquer algorithm. 14 15 16.. container:: section 17 18 19 .. rubric:: Context 20 :class: sectiontitle 21 22 Divide and conquer is widely used in serial algorithms. Common 23 examples are quicksort and mergesort. 24 25 26.. container:: section 27 28 29 .. rubric:: Forces 30 :class: sectiontitle 31 32 - Problem can be transformed into subproblems that can be solved 33 independently. 34 35 36 - Splitting problem or merging solutions is relatively cheap 37 compared to cost of solving the subproblems. 38 39 40.. container:: section 41 42 43 .. rubric:: Solution 44 :class: sectiontitle 45 46 There are several ways to implement divide and conquer in 47 |full_name|. The best choice depends upon circumstances. 48 49 50 - If division always yields the same number of subproblems, use 51 recursion and ``oneapi::tbb::parallel_invoke``. 52 53 54 - If the number of subproblems varies, use recursion and 55 ``oneapi::tbb::task_group``. 56 57 58.. container:: section 59 60 61 .. rubric:: Example 62 :class: sectiontitle 63 64 Quicksort is a classic divide-and-conquer algorithm. It divides a 65 sorting problem into two subsorts. A simple serial version looks like [1]_. 66 67 68 :: 69 70 71 void SerialQuicksort( T* begin, T* end ) { 72 if( end-begin>1 ) { 73 using namespace std; 74 T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) ); 75 swap( *begin, mid[-1] ); 76 SerialQuicksort( begin, mid-1 ); 77 SerialQuicksort( mid, end ); 78 } 79 } 80 81 82 The number of subsorts is fixed at two, so ``oneapi::tbb::parallel_invoke`` 83 provides a simple way to parallelize it. The parallel code is shown 84 below: 85 86 87 :: 88 89 90 void ParallelQuicksort( T* begin, T* end ) { 91 if( end-begin>1 ) { 92 using namespace std; 93 T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) ); 94 swap( *begin, mid[-1] ); 95 oneapi::tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, 96 [=]{ParallelQuicksort( mid, end );} ); 97 } 98 } 99 100 101 Eventually the subsorts become small enough that serial execution is 102 more efficient. The following variation, does sorts of less than 500 elements using the earlier serial code. 103 104 105 :: 106 107 108 void ParallelQuicksort( T* begin, T* end ) { 109 if( end-begin>=500 ) { 110 using namespace std; 111 T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) ); 112 swap( *begin, mid[-1] ); 113 oneapi::tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, 114 [=]{ParallelQuicksort( mid, end );} ); 115 } else { 116 SerialQuicksort( begin, end ); 117 } 118 } 119 120 121 The change is an instance of the Agglomeration pattern. 122 123 124 The next example considers a problem where there are a variable 125 number of subproblems. The problem involves a tree-like description 126 of a mechanical assembly. There are two kinds of nodes: 127 128 129 - Leaf nodes represent individual parts. 130 131 132 - Internal nodes represent groups of parts. 133 134 135 The problem is to find all nodes that collide with a target node. The 136 following code shows a serial solution that walks the tree. It 137 records in ``Hits`` any nodes that collide with ``Target``. 138 139 140 :: 141 142 143 std::list<Node*> Hits; 144 Node* Target; 145 146 147 void SerialFindCollisions( Node& x ) { 148 if( x.is_leaf() ) { 149 if( x.collides_with( *Target ) ) 150 Hits.push_back(&x); 151 } else { 152 for( Node::const_iterator y=x.begin();y!=x.end(); ++y ) 153 SerialFindCollisions(*y); 154 } 155 } 156 157 158 A parallel version is shown below. 159 160 161 :: 162 163 164 typedef oneapi::tbb::enumerable_thread_specific<std::list<Node*> > LocalList; 165 LocalList LocalHits; 166 Node* Target; // Target node 167 168 169 void ParallelWalk( Node& x ) { 170 if( x.is_leaf() ) { 171 if( x.collides_with( *Target ) ) 172 LocalHits.local().push_back(&x); 173 } else { 174 // Recurse on each child y of x in parallel 175 oneapi::tbb::task_group g; 176 for( Node::const_iterator y=x.begin(); y!=x.end(); ++y ) 177 g.run( [=]{ParallelWalk(*y);} ); 178 // Wait for recursive calls to complete 179 g.wait(); 180 } 181 } 182 183 184 void ParallelFindCollisions( Node& x ) { 185 ParallelWalk(x); 186 for(LocalList::iterator i=LocalHits.begin();i!=LocalHits.end(); ++i) 187 Hits.splice( Hits.end(), *i ); 188 } 189 190 191 The recursive walk is parallelized using class ``task_group`` to do 192 recursive calls in parallel. 193 194 195 There is another significant change because of the parallelism that 196 is introduced. Because it would be unsafe to update ``Hits`` 197 concurrently, the parallel walk uses variable ``LocalHits`` to 198 accumulate results. Because it is of type 199 ``enumerable_thread_specific``, each thread accumulates its own 200 private result. The results are spliced together into Hits after the 201 walk completes. 202 203 204 The results will *not* be in the same order as the original serial 205 code. 206 207 208 If parallel overhead is high, use the agglomeration pattern. For 209 example, use the serial walk for subtrees under a certain threshold. 210 211 212.. [1] Production quality quicksort implementations typically 213 use more sophisticated pivot selection, explicit stacks instead of 214 recursion, and some other sorting algorithm for small subsorts. The 215 simple algorithm is used here to focus on exposition of the parallel 216 pattern. 217 218