1.. _Wavefront:
2
3Wavefront
4=========
5
6
7.. container:: section
8
9
10   .. rubric:: Problem
11      :class: sectiontitle
12
13   Perform computations on items in a data set, where the computation on
14   an item uses results from computations on predecessor items.
15
16
17.. container:: section
18
19
20   .. rubric:: Context
21      :class: sectiontitle
22
23   The dependences between computations form an acyclic graph.
24
25
26.. container:: section
27
28
29   .. rubric:: Forces
30      :class: sectiontitle
31
32   -  Dependence constraints between items form an acyclic graph.
33
34
35   -  The number of immediate predecessors in the graph is known in
36      advance, or can be determined some time before the last
37      predecessor completes.
38
39
40.. container:: section
41
42
43   .. rubric:: Solution
44      :class: sectiontitle
45
46   The solution is a parallel variant of topological sorting, using
47   ``oneapi::tbb::parallel_for_each`` to process items. Associate an atomic
48   counter with each item. Initialize each counter to the number of
49   predecessors. Invoke ``oneapi::tbb::parallel_for_each`` to process the items that
50   have no predessors (have counts of zero). After an item is processed,
51   decrement the counters of its successors. If a successor's counter
52   reaches zero, add that successor to the ``oneapi::tbb::parallel_for_each``
53   via a "feeder".
54
55
56   If the number of predecessors for an item cannot be determined in
57   advance, treat the information "know number of predecessors" as an
58   additional predecessor. When the number of predecessors becomes
59   known, treat this conceptual predecessor as completed.
60
61
62   If the overhead of counting individual items is excessive, aggregate
63   items into blocks, and do the wavefront over the blocks.
64
65
66.. container:: section
67
68
69   .. rubric:: Example
70      :class: sectiontitle
71
72   Below is a serial kernel for the longest common subsequence
73   algorithm. The parameters are strings ``x`` and ``y`` with respective
74   lengths ``xlen`` and ``ylen``.
75
76
77   ::
78
79
80      int F[MAX_LEN+1][MAX_LEN+1];
81
82
83      void SerialLCS( const char* x, size_t xlen, const char* y, size_t ylen )
84      {
85         for( size_t i=1; i<=xlen; ++i )
86             for( size_t j=1; j<=ylen; ++j )
87                 F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1:
88                                            max(F[i][j-1],F[i-1][j]);
89      }
90
91
92   The kernel sets ``F[i][j]`` to the length of the longest common
93   subsequence shared by ``x[0..i-1]`` and ``y[0..j-1]``. It assumes
94   that F[0][0..ylen] and ``F[0..xlen][0]`` have already been
95   initialized to zero.
96
97
98   The following figure shows the data dependences for calculating
99   ``F[i][j]``.
100
101
102   .. container:: fignone
103      :name: fig3
104
105
106      Data dependences for longest common substring calculation.
107      |image0|
108
109
110   The following figure shows the gray diagonal dependence is the
111   transitive closure of other dependencies. Thus for parallelization
112   purposes it is a redundant dependence that can be ignored.
113
114
115   .. container:: fignone
116      :name: fig4
117
118
119      Diagonal dependence is redundant.
120      |image1|
121
122
123   It is generally good to remove redundant dependences from
124   consideration, because the atomic counting incurs a cost for each
125   dependence considered.
126
127
128   Another consideration is grain size. Scheduling each ``F[i][j]``
129   element calculation separately is prohibitively expensive. A good
130   solution is to aggregate the elements into contiguous blocks, and
131   process the contents of a block serially. The blocks have the same
132   dependence pattern, but at a block scale. Hence scheduling overheads
133   can be amortized over blocks.
134
135
136   The parallel code follows. Each block consists of ``N×N`` elements.
137   Each block has an associated atomic counter. Array ``Count``
138   organizes these counters for easy lookup. The code initializes the
139   counters and then rolls a wavefront using ``parallel_for_each``,
140   starting with the block at the origin since it has no predecessors.
141
142
143   ::
144
145
146      const int N = 64;
147      std::atomic<char> Count[MAX_LEN/N+1][MAX_LEN/N+1];
148       
149
150      void ParallelLCS( const char* x, size_t xlen, const char* y, size_t ylen ) {
151         // Initialize predecessor counts for blocks.
152         size_t m = (xlen+N-1)/N;
153         size_t n = (ylen+N-1)/N;
154         for( int i=0; i<m; ++i )
155             for( int j=0; j<n; ++j )
156                 Count[i][j] = (i>0)+(j>0);
157         // Roll the wavefront from the origin.
158         typedef pair<size_t,size_t> block;
159         block origin(0,0);
160         oneapi::tbb::parallel_for_each( &origin, &origin+1,
161             [=]( const block& b, oneapi::tbb::feeder<block>&feeder ) {
162                 // Extract bounds on block
163                 size_t bi = b.first;
164                 size_t bj = b.second;
165                 size_t xl = N*bi+1;
166                 size_t xu = min(xl+N,xlen+1);
167                 size_t yl = N*bj+1;
168                 size_t yu = min(yl+N,ylen+1);
169                 // Process the block
170                 for( size_t i=xl; i<xu; ++i )
171                     for( size_t j=yl; j<yu; ++j )
172                         F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1:
173                                                    max(F[i][j-1],F[i-1][j]);
174                 // Account for successors
175                 if( bj+1<n && --Count[bi][bj+1]==0 )
176                     feeder.add( block(bi,bj+1) );
177                 if( bi+1<m && --Count[bi+1][bj]==0 )
178                     feeder.add( block(bi+1,bj) );       }
179         );
180      }
181
182
183.. container:: section
184
185
186   .. rubric:: References
187      :class: sectiontitle
188
189   Eun-Gyu Kim and Mark Snir, "Wavefront Pattern",
190   http://snir.cs.illinois.edu/patterns/wavefront.pdf
191
192
193.. |image0| image:: Images/image005a.jpg
194   :width: 122px
195   :height: 122px
196.. |image1| image:: Images/image006a.jpg
197   :width: 122px
198   :height: 122px
199
200