1.. _Bandwidth_and_Cache_Affinity:
2
3Bandwidth and Cache Affinity
4============================
5
6
7For a sufficiently simple function ``Foo``, the examples might not show
8good speedup when written as parallel loops. The cause could be
9insufficient system bandwidth between the processors and memory. In that
10case, you may have to rethink your algorithm to take better advantage of
11cache. Restructuring to better utilize the cache usually benefits the
12parallel program as well as the serial program.
13
14
15An alternative to restructuring that works in some cases is
16``affinity_partitioner.`` It not only automatically chooses the
17grainsize, but also optimizes for cache affinity and tries to distribute
18the data uniformly among threads. Using ``affinity_partitioner`` can
19significantly improve performance when:
20
21
22-  The computation does a few operations per data access.
23
24
25-  The data acted upon by the loop fits in cache.
26
27
28-  The loop, or a similar loop, is re-executed over the same data.
29
30
31-  There are more than two hardware threads available (and especially if
32   the number of threads is not a power of two). If only two threads are
33   available, the default scheduling in |full_name|
34   usually provides sufficient cache affinity.
35
36
37The following code shows how to use ``affinity_partitioner``.
38
39
40::
41
42
43   #include "oneapi/tbb.h"
44    
45
46   void ParallelApplyFoo( float a[], size_t n ) {
47       static affinity_partitioner ap;
48       parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), ap);
49   }
50    
51
52   void TimeStepFoo( float a[], size_t n, int steps ) {
53       for( int t=0; t<steps; ++t )
54           ParallelApplyFoo( a, n );
55   }
56
57
58In the example, the ``affinity_partitioner`` object ``ap`` lives between
59loop iterations. It remembers where iterations of the loop ran, so that
60each iteration can be handed to the same thread that executed it before.
61The example code gets the lifetime of the partitioner right by declaring
62the ``affinity_partitioner`` as a local static object. Another approach
63would be to declare it at a scope outside the iterative loop in
64``TimeStepFoo``, and hand it down the call chain to ``parallel_for``.
65
66
67If the data does not fit across the system’s caches, there may be little
68benefit. The following figure shows the situations.
69
70
71.. container:: fignone
72   :name: fig3
73
74
75   Benefit of Affinity Determined by Relative Size of Data Set and Cache
76   |image0|
77
78
79The next figure shows how parallel speedup might vary with the size of a
80data set. The computation for the example is ``A[i]+=B[i]`` for ``i`` in
81the range [0,N). It was chosen for dramatic effect. You are unlikely to
82see quite this much variation in your code. The graph shows not much
83improvement at the extremes. For small N, parallel scheduling overhead
84dominates, resulting in little speedup. For large N, the data set is too
85large to be carried in cache between loop invocations. The peak in the
86middle is the sweet spot for affinity. Hence ``affinity_partitioner``
87should be considered a tool, not a cure-all, when there is a low ratio
88of computations to memory accesses.
89
90
91.. container:: fignone
92   :name: fig4
93
94
95   Improvement from Affinity Dependent on Array Size
96   |image1|
97
98
99
100
101.. |image0| image:: Images/image007.jpg
102   :width: 453px
103   :height: 178px
104.. |image1| image:: Images/image008.jpg
105   :width: 551px
106   :height: 192px
107
108