1.. _Controlling_Chunking:
2
3Controlling Chunking
4====================
5
6
7Chunking is controlled by a *partitioner* and a *grainsize.*\  To gain
8the most control over chunking, you specify both.
9
10
11-  Specify ``simple_partitioner()`` as the third argument to
12   ``parallel_for``. Doing so turns off automatic chunking.
13
14
15-  Specify the grainsize when constructing the range. The thread
16   argument form of the constructor is
17   ``blocked_range<T>(begin,end,grainsize)``. The default value of
18   ``grainsize`` is 1. It is in units of loop iterations per chunk.
19
20
21If the chunks are too small, the overhead may exceed the performance
22advantage.
23
24
25The following code is the last example from parallel_for, modified to
26use an explicit grainsize ``G``.
27
28
29::
30
31
32   #include "oneapi/tbb.h"
33    
34
35   void ParallelApplyFoo( float a[], size_t n ) {
36       parallel_for(blocked_range<size_t>(0,n,G), ApplyFoo(a),
37                    simple_partitioner());
38   }
39
40
41The grainsize sets a minimum threshold for parallelization. The
42``parallel_for`` in the example invokes ``ApplyFoo::operator()`` on
43chunks, possibly of different sizes. Let *chunksize* be the number of
44iterations in a chunk. Using ``simple_partitioner`` guarantees that
45[G/2] <= *chunksize* <= G.
46
47
48There is also an intermediate level of control where you specify the
49grainsize for the range, but use an ``auto_partitioner`` and
50``affinity_partitioner``. An ``auto_partitioner`` is the default
51partitioner. Both partitioners implement the automatic grainsize
52heuristic described in :ref:`Automatic_Chunking`. An
53``affinity_partitioner`` implies an additional hint, as explained later
54in Section :ref:`Bandwidth_and_Cache_Affinity`. Though these partitioners
55may cause chunks to have more than G iterations, they never generate
56chunks with less than [G/2] iterations. Specifying a range with an
57explicit grainsize may occasionally be useful to prevent these
58partitioners from generating wastefully small chunks if their heuristics
59fail.
60
61
62Because of the impact of grainsize on parallel loops, it is worth
63reading the following material even if you rely on ``auto_partitioner``
64and ``affinity_partitioner`` to choose the grainsize automatically.
65
66
67.. container:: tablenoborder
68
69
70   .. list-table::
71      :header-rows: 1
72
73      * -     |image0|
74        -     |image1|
75      * -     Case A
76        -     Case B
77
78
79
80
81The above figure illustrates the impact of grainsize by showing the
82useful work as the gray area inside a brown border that represents
83overhead. Both Case A and Case B have the same total gray area. Case A
84shows how too small a grainsize leads to a relatively high proportion of
85overhead. Case B shows how a large grainsize reduces this proportion, at
86the cost of reducing potential parallelism. The overhead as a fraction
87of useful work depends upon the grainsize, not on the number of grains.
88Consider this relationship and not the total number of iterations or
89number of processors when setting a grainsize.
90
91
92A rule of thumb is that ``grainsize`` iterations of ``operator()``
93should take at least 100,000 clock cycles to execute. For example, if a
94single iteration takes 100 clocks, then the ``grainsize`` needs to be at
95least 1000 iterations. When in doubt, do the following experiment:
96
97
98#. Set the ``grainsize`` parameter higher than necessary. The grainsize
99   is specified in units of loop iterations. If you have no idea of how
100   many clock cycles an iteration might take, start with
101   ``grainsize``\ =100,000. The rationale is that each iteration
102   normally requires at least one clock per iteration. In most cases,
103   step 3 will guide you to a much smaller value.
104
105
106#. Run your algorithm.
107
108
109#. Iteratively halve the ``grainsize`` parameter and see how much the
110   algorithm slows down or speeds up as the value decreases.
111
112
113A drawback of setting a grainsize too high is that it can reduce
114parallelism. For example, if the grainsize is 1000 and the loop has 2000
115iterations, the ``parallel_for`` distributes the loop across only two
116processors, even if more are available. However, if you are unsure, err
117on the side of being a little too high instead of a little too low,
118because too low a value hurts serial performance, which in turns hurts
119parallel performance if there is other parallelism available higher up
120in the call tree.
121
122
123.. tip::
124   You do not have to set the grainsize too precisely.
125
126
127The next figure shows the typical "bathtub curve" for execution time
128versus grainsize, based on the floating point ``a[i]=b[i]*c``
129computation over a million indices. There is little work per iteration.
130The times were collected on a four-socket machine with eight hardware
131threads.
132
133
134.. container:: fignone
135   :name: fig2
136
137
138   Wall Clock Time Versus Grainsize
139   |image2|
140
141
142The scale is logarithmic. The downward slope on the left side indicates
143that with a grainsize of one, most of the overhead is parallel
144scheduling overhead, not useful work. An increase in grainsize brings a
145proportional decrease in parallel overhead. Then the curve flattens out
146because the parallel overhead becomes insignificant for a sufficiently
147large grainsize. At the end on the right, the curve turns up because the
148chunks are so large that there are fewer chunks than available hardware
149threads. Notice that a grainsize over the wide range 100-100,000 works
150quite well.
151
152
153.. tip::
154   A general rule of thumb for parallelizing loop nests is to
155   parallelize the outermost one possible. The reason is that each
156   iteration of an outer loop is likely to provide a bigger grain of
157   work than an iteration of an inner loop.
158
159
160
161.. |image0| image:: Images/image002.jpg
162   :width: 161px
163   :height: 163px
164.. |image1| image:: Images/image004.jpg
165   :width: 157px
166   :height: 144px
167.. |image2| image:: Images/image006.jpg
168   :width: 462px
169   :height: 193px
170
171