1.. _Wavefront: 2 3Wavefront 4========= 5 6 7.. container:: section 8 9 10 .. rubric:: Problem 11 :class: sectiontitle 12 13 Perform computations on items in a data set, where the computation on 14 an item uses results from computations on predecessor items. 15 16 17.. container:: section 18 19 20 .. rubric:: Context 21 :class: sectiontitle 22 23 The dependences between computations form an acyclic graph. 24 25 26.. container:: section 27 28 29 .. rubric:: Forces 30 :class: sectiontitle 31 32 - Dependence constraints between items form an acyclic graph. 33 34 35 - The number of immediate predecessors in the graph is known in 36 advance, or can be determined some time before the last 37 predecessor completes. 38 39 40.. container:: section 41 42 43 .. rubric:: Solution 44 :class: sectiontitle 45 46 The solution is a parallel variant of topological sorting, using 47 ``oneapi::tbb::parallel_for_each`` to process items. Associate an atomic 48 counter with each item. Initialize each counter to the number of 49 predecessors. Invoke ``oneapi::tbb::parallel_for_each`` to process the items that 50 have no predessors (have counts of zero). After an item is processed, 51 decrement the counters of its successors. If a successor's counter 52 reaches zero, add that successor to the ``oneapi::tbb::parallel_for_each`` 53 via a "feeder". 54 55 56 If the number of predecessors for an item cannot be determined in 57 advance, treat the information "know number of predecessors" as an 58 additional predecessor. When the number of predecessors becomes 59 known, treat this conceptual predecessor as completed. 60 61 62 If the overhead of counting individual items is excessive, aggregate 63 items into blocks, and do the wavefront over the blocks. 64 65 66.. container:: section 67 68 69 .. rubric:: Example 70 :class: sectiontitle 71 72 Below is a serial kernel for the longest common subsequence 73 algorithm. The parameters are strings ``x`` and ``y`` with respective 74 lengths ``xlen`` and ``ylen``. 75 76 77 :: 78 79 80 int F[MAX_LEN+1][MAX_LEN+1]; 81 82 83 void SerialLCS( const char* x, size_t xlen, const char* y, size_t ylen ) 84 { 85 for( size_t i=1; i<=xlen; ++i ) 86 for( size_t j=1; j<=ylen; ++j ) 87 F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1: 88 max(F[i][j-1],F[i-1][j]); 89 } 90 91 92 The kernel sets ``F[i][j]`` to the length of the longest common 93 subsequence shared by ``x[0..i-1]`` and ``y[0..j-1]``. It assumes 94 that F[0][0..ylen] and ``F[0..xlen][0]`` have already been 95 initialized to zero. 96 97 98 The following figure shows the data dependences for calculating 99 ``F[i][j]``. 100 101 102 .. container:: fignone 103 :name: fig3 104 105 106 Data dependences for longest common substring calculation. 107 |image0| 108 109 110 The following figure shows the gray diagonal dependence is the 111 transitive closure of other dependencies. Thus for parallelization 112 purposes it is a redundant dependence that can be ignored. 113 114 115 .. container:: fignone 116 :name: fig4 117 118 119 Diagonal dependence is redundant. 120 |image1| 121 122 123 It is generally good to remove redundant dependences from 124 consideration, because the atomic counting incurs a cost for each 125 dependence considered. 126 127 128 Another consideration is grain size. Scheduling each ``F[i][j]`` 129 element calculation separately is prohibitively expensive. A good 130 solution is to aggregate the elements into contiguous blocks, and 131 process the contents of a block serially. The blocks have the same 132 dependence pattern, but at a block scale. Hence scheduling overheads 133 can be amortized over blocks. 134 135 136 The parallel code follows. Each block consists of ``N×N`` elements. 137 Each block has an associated atomic counter. Array ``Count`` 138 organizes these counters for easy lookup. The code initializes the 139 counters and then rolls a wavefront using ``parallel_for_each``, 140 starting with the block at the origin since it has no predecessors. 141 142 143 :: 144 145 146 const int N = 64; 147 std::atomic<char> Count[MAX_LEN/N+1][MAX_LEN/N+1]; 148 149 150 void ParallelLCS( const char* x, size_t xlen, const char* y, size_t ylen ) { 151 // Initialize predecessor counts for blocks. 152 size_t m = (xlen+N-1)/N; 153 size_t n = (ylen+N-1)/N; 154 for( int i=0; i<m; ++i ) 155 for( int j=0; j<n; ++j ) 156 Count[i][j] = (i>0)+(j>0); 157 // Roll the wavefront from the origin. 158 typedef pair<size_t,size_t> block; 159 block origin(0,0); 160 oneapi::tbb::parallel_for_each( &origin, &origin+1, 161 [=]( const block& b, oneapi::tbb::feeder<block>&feeder ) { 162 // Extract bounds on block 163 size_t bi = b.first; 164 size_t bj = b.second; 165 size_t xl = N*bi+1; 166 size_t xu = min(xl+N,xlen+1); 167 size_t yl = N*bj+1; 168 size_t yu = min(yl+N,ylen+1); 169 // Process the block 170 for( size_t i=xl; i<xu; ++i ) 171 for( size_t j=yl; j<yu; ++j ) 172 F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1: 173 max(F[i][j-1],F[i-1][j]); 174 // Account for successors 175 if( bj+1<n && --Count[bi][bj+1]==0 ) 176 feeder.add( block(bi,bj+1) ); 177 if( bi+1<m && --Count[bi+1][bj]==0 ) 178 feeder.add( block(bi+1,bj) ); } 179 ); 180 } 181 182 183.. container:: section 184 185 186 .. rubric:: References 187 :class: sectiontitle 188 189 Eun-Gyu Kim and Mark Snir, "Wavefront Pattern", 190 http://snir.cs.illinois.edu/patterns/wavefront.pdf 191 192 193.. |image0| image:: Images/image005a.jpg 194 :width: 122px 195 :height: 122px 196.. |image1| image:: Images/image006a.jpg 197 :width: 122px 198 :height: 122px 199 200