1.. _Controlling_Chunking: 2 3Controlling Chunking 4==================== 5 6 7Chunking is controlled by a *partitioner* and a *grainsize.*\ To gain 8the most control over chunking, you specify both. 9 10 11- Specify ``simple_partitioner()`` as the third argument to 12 ``parallel_for``. Doing so turns off automatic chunking. 13 14 15- Specify the grainsize when constructing the range. The thread 16 argument form of the constructor is 17 ``blocked_range<T>(begin,end,grainsize)``. The default value of 18 ``grainsize`` is 1. It is in units of loop iterations per chunk. 19 20 21If the chunks are too small, the overhead may exceed the performance 22advantage. 23 24 25The following code is the last example from parallel_for, modified to 26use an explicit grainsize ``G``. 27 28 29:: 30 31 32 #include "oneapi/tbb.h" 33 34 35 void ParallelApplyFoo( float a[], size_t n ) { 36 parallel_for(blocked_range<size_t>(0,n,G), ApplyFoo(a), 37 simple_partitioner()); 38 } 39 40 41The grainsize sets a minimum threshold for parallelization. The 42``parallel_for`` in the example invokes ``ApplyFoo::operator()`` on 43chunks, possibly of different sizes. Let *chunksize* be the number of 44iterations in a chunk. Using ``simple_partitioner`` guarantees that 45[G/2] <= *chunksize* <= G. 46 47 48There is also an intermediate level of control where you specify the 49grainsize for the range, but use an ``auto_partitioner`` and 50``affinity_partitioner``. An ``auto_partitioner`` is the default 51partitioner. Both partitioners implement the automatic grainsize 52heuristic described in :ref:`Automatic_Chunking`. An 53``affinity_partitioner`` implies an additional hint, as explained later 54in Section :ref:`Bandwidth_and_Cache_Affinity`. Though these partitioners 55may cause chunks to have more than G iterations, they never generate 56chunks with less than [G/2] iterations. Specifying a range with an 57explicit grainsize may occasionally be useful to prevent these 58partitioners from generating wastefully small chunks if their heuristics 59fail. 60 61 62Because of the impact of grainsize on parallel loops, it is worth 63reading the following material even if you rely on ``auto_partitioner`` 64and ``affinity_partitioner`` to choose the grainsize automatically. 65 66 67.. container:: tablenoborder 68 69 70 .. list-table:: 71 :header-rows: 1 72 73 * - |image0| 74 - |image1| 75 * - Case A 76 - Case B 77 78 79 80 81The above figure illustrates the impact of grainsize by showing the 82useful work as the gray area inside a brown border that represents 83overhead. Both Case A and Case B have the same total gray area. Case A 84shows how too small a grainsize leads to a relatively high proportion of 85overhead. Case B shows how a large grainsize reduces this proportion, at 86the cost of reducing potential parallelism. The overhead as a fraction 87of useful work depends upon the grainsize, not on the number of grains. 88Consider this relationship and not the total number of iterations or 89number of processors when setting a grainsize. 90 91 92A rule of thumb is that ``grainsize`` iterations of ``operator()`` 93should take at least 100,000 clock cycles to execute. For example, if a 94single iteration takes 100 clocks, then the ``grainsize`` needs to be at 95least 1000 iterations. When in doubt, do the following experiment: 96 97 98#. Set the ``grainsize`` parameter higher than necessary. The grainsize 99 is specified in units of loop iterations. If you have no idea of how 100 many clock cycles an iteration might take, start with 101 ``grainsize``\ =100,000. The rationale is that each iteration 102 normally requires at least one clock per iteration. In most cases, 103 step 3 will guide you to a much smaller value. 104 105 106#. Run your algorithm. 107 108 109#. Iteratively halve the ``grainsize`` parameter and see how much the 110 algorithm slows down or speeds up as the value decreases. 111 112 113A drawback of setting a grainsize too high is that it can reduce 114parallelism. For example, if the grainsize is 1000 and the loop has 2000 115iterations, the ``parallel_for`` distributes the loop across only two 116processors, even if more are available. However, if you are unsure, err 117on the side of being a little too high instead of a little too low, 118because too low a value hurts serial performance, which in turns hurts 119parallel performance if there is other parallelism available higher up 120in the call tree. 121 122 123.. tip:: 124 You do not have to set the grainsize too precisely. 125 126 127The next figure shows the typical "bathtub curve" for execution time 128versus grainsize, based on the floating point ``a[i]=b[i]*c`` 129computation over a million indices. There is little work per iteration. 130The times were collected on a four-socket machine with eight hardware 131threads. 132 133 134.. container:: fignone 135 :name: fig2 136 137 138 Wall Clock Time Versus Grainsize 139 |image2| 140 141 142The scale is logarithmic. The downward slope on the left side indicates 143that with a grainsize of one, most of the overhead is parallel 144scheduling overhead, not useful work. An increase in grainsize brings a 145proportional decrease in parallel overhead. Then the curve flattens out 146because the parallel overhead becomes insignificant for a sufficiently 147large grainsize. At the end on the right, the curve turns up because the 148chunks are so large that there are fewer chunks than available hardware 149threads. Notice that a grainsize over the wide range 100-100,000 works 150quite well. 151 152 153.. tip:: 154 A general rule of thumb for parallelizing loop nests is to 155 parallelize the outermost one possible. The reason is that each 156 iteration of an outer loop is likely to provide a bigger grain of 157 work than an iteration of an inner loop. 158 159 160 161.. |image0| image:: Images/image002.jpg 162 :width: 161px 163 :height: 163px 164.. |image1| image:: Images/image004.jpg 165 :width: 157px 166 :height: 144px 167.. |image2| image:: Images/image006.jpg 168 :width: 462px 169 :height: 193px 170 171