1.. _Elementwise:
2
3Elementwise
4===========
5
6
7.. container:: section
8
9
10   .. rubric:: Problem
11      :class: sectiontitle
12
13   Initiate similar independent computations across items in a data set,
14   and wait until all complete.
15
16
17.. container:: section
18
19
20   .. rubric:: Context
21      :class: sectiontitle
22
23   Many serial algorithms sweep over a set of items and do an
24   independent computation on each item. However, if some kind of
25   summary information is collected, use the Reduction pattern instead.
26
27
28.. container:: section
29
30
31   .. rubric:: Forces
32      :class: sectiontitle
33
34   No information is carried or merged between the computations.
35
36
37.. container:: section
38
39
40   .. rubric:: Solution
41      :class: sectiontitle
42
43   If the number of items is known in advance, use
44   ``oneapi::tbb::parallel_for``. If not, consider using
45   ``oneapi::tbb::parallel_for_each``.
46
47
48   Use agglomeration if the individual computations are small relative
49   to scheduler overheads.
50
51
52   If the pattern is followed by a reduction on the same data, consider
53   doing the element-wise operation as part of the reduction, so that
54   the combination of the two patterns is accomplished in a single sweep
55   instead of two sweeps. Doing so may improve performance by reducing
56   traffic through the memory hierarchy.
57
58
59.. container:: section
60
61
62   .. rubric:: Example
63      :class: sectiontitle
64
65   Convolution is often used in signal processing. The convolution of a
66   filter ``c`` and signal ``x`` is computed as:
67
68
69   |image0|
70   Serial code for this computation might look like:
71
72
73   ::
74
75
76      // Assumes c[0..clen-1] and x[1-clen..xlen-1] are defined
77      for( int i=0; i<xlen+clen-1; ++i ) {
78         float tmp = 0;
79         for( int j=0; j<clen; ++j )
80             tmp += c[j]*x[i-j];
81         y[i] = tmp;
82      }
83
84
85   For simplicity, the fragment assumes that ``x`` is a pointer into an
86   array padded with zeros such that ``x[k]``\ returns zero when ``k<0``
87   or ``k≥xlen``.
88
89
90   The inner loop does not fit the elementwise pattern, because each
91   iteration depends on the previous iteration. However, the outer loop
92   fits the elementwise pattern. It is straightforward to render it
93   using ``oneapi::tbb::parallel_for`` as shown:
94
95
96   ::
97
98
99      oneapi::tbb::parallel_for( 0, xlen+clen-1, [=]( int i ) {
100         float tmp = 0;
101         for( int j=0; j<clen; ++j )
102             tmp += c[j]*x[i-j];
103         y[i] = tmp;
104      });
105
106
107   ``oneapi::tbb::parallel_for`` does automatic agglomeration by implicitly
108   using ``oneapi::tbb::auto_partitioner`` in its underlying implementation. If
109   there is reason to agglomerate explicitly, use the overload of
110   ``oneapi::tbb::parallel_for`` that takes an explicit range argument. The
111   following shows the example transformed to use the overload.
112
113
114   ::
115
116
117      oneapi::tbb::parallel_for(
118         oneapi::tbb::blocked_range<int>(0,xlen+clen-1,1000),
119         [=]( oneapi::tbb::blocked_range<int> r ) {
120               int end = r.end();
121             for( int i=r.begin(); i!=end; ++i ) {
122                 float tmp = 0;
123                 for( int j=0; j<clen; ++j )
124                     tmp += c[j]*x[i-j];
125                 y[i] = tmp;
126             }
127         }
128      );
129
130
131    
132
133
134
135.. |image0| image:: Images/image004a.jpg
136   :width: 99px
137   :height: 29px
138
139