1.. _Elementwise: 2 3Elementwise 4=========== 5 6 7.. container:: section 8 9 10 .. rubric:: Problem 11 :class: sectiontitle 12 13 Initiate similar independent computations across items in a data set, 14 and wait until all complete. 15 16 17.. container:: section 18 19 20 .. rubric:: Context 21 :class: sectiontitle 22 23 Many serial algorithms sweep over a set of items and do an 24 independent computation on each item. However, if some kind of 25 summary information is collected, use the Reduction pattern instead. 26 27 28.. container:: section 29 30 31 .. rubric:: Forces 32 :class: sectiontitle 33 34 No information is carried or merged between the computations. 35 36 37.. container:: section 38 39 40 .. rubric:: Solution 41 :class: sectiontitle 42 43 If the number of items is known in advance, use 44 ``oneapi::tbb::parallel_for``. If not, consider using 45 ``oneapi::tbb::parallel_for_each``. 46 47 48 Use agglomeration if the individual computations are small relative 49 to scheduler overheads. 50 51 52 If the pattern is followed by a reduction on the same data, consider 53 doing the element-wise operation as part of the reduction, so that 54 the combination of the two patterns is accomplished in a single sweep 55 instead of two sweeps. Doing so may improve performance by reducing 56 traffic through the memory hierarchy. 57 58 59.. container:: section 60 61 62 .. rubric:: Example 63 :class: sectiontitle 64 65 Convolution is often used in signal processing. The convolution of a 66 filter ``c`` and signal ``x`` is computed as: 67 68 69 |image0| 70 Serial code for this computation might look like: 71 72 73 :: 74 75 76 // Assumes c[0..clen-1] and x[1-clen..xlen-1] are defined 77 for( int i=0; i<xlen+clen-1; ++i ) { 78 float tmp = 0; 79 for( int j=0; j<clen; ++j ) 80 tmp += c[j]*x[i-j]; 81 y[i] = tmp; 82 } 83 84 85 For simplicity, the fragment assumes that ``x`` is a pointer into an 86 array padded with zeros such that ``x[k]``\ returns zero when ``k<0`` 87 or ``k≥xlen``. 88 89 90 The inner loop does not fit the elementwise pattern, because each 91 iteration depends on the previous iteration. However, the outer loop 92 fits the elementwise pattern. It is straightforward to render it 93 using ``oneapi::tbb::parallel_for`` as shown: 94 95 96 :: 97 98 99 oneapi::tbb::parallel_for( 0, xlen+clen-1, [=]( int i ) { 100 float tmp = 0; 101 for( int j=0; j<clen; ++j ) 102 tmp += c[j]*x[i-j]; 103 y[i] = tmp; 104 }); 105 106 107 ``oneapi::tbb::parallel_for`` does automatic agglomeration by implicitly 108 using ``oneapi::tbb::auto_partitioner`` in its underlying implementation. If 109 there is reason to agglomerate explicitly, use the overload of 110 ``oneapi::tbb::parallel_for`` that takes an explicit range argument. The 111 following shows the example transformed to use the overload. 112 113 114 :: 115 116 117 oneapi::tbb::parallel_for( 118 oneapi::tbb::blocked_range<int>(0,xlen+clen-1,1000), 119 [=]( oneapi::tbb::blocked_range<int> r ) { 120 int end = r.end(); 121 for( int i=r.begin(); i!=end; ++i ) { 122 float tmp = 0; 123 for( int j=0; j<clen; ++j ) 124 tmp += c[j]*x[i-j]; 125 y[i] = tmp; 126 } 127 } 128 ); 129 130 131 132 133 134 135.. |image0| image:: Images/image004a.jpg 136 :width: 99px 137 :height: 29px 138 139