1.. _Bandwidth_and_Cache_Affinity: 2 3Bandwidth and Cache Affinity 4============================ 5 6 7For a sufficiently simple function ``Foo``, the examples might not show 8good speedup when written as parallel loops. The cause could be 9insufficient system bandwidth between the processors and memory. In that 10case, you may have to rethink your algorithm to take better advantage of 11cache. Restructuring to better utilize the cache usually benefits the 12parallel program as well as the serial program. 13 14 15An alternative to restructuring that works in some cases is 16``affinity_partitioner.`` It not only automatically chooses the 17grainsize, but also optimizes for cache affinity and tries to distribute 18the data uniformly among threads. Using ``affinity_partitioner`` can 19significantly improve performance when: 20 21 22- The computation does a few operations per data access. 23 24 25- The data acted upon by the loop fits in cache. 26 27 28- The loop, or a similar loop, is re-executed over the same data. 29 30 31- There are more than two hardware threads available (and especially if 32 the number of threads is not a power of two). If only two threads are 33 available, the default scheduling in |full_name| 34 usually provides sufficient cache affinity. 35 36 37The following code shows how to use ``affinity_partitioner``. 38 39 40:: 41 42 43 #include "oneapi/tbb.h" 44 45 46 void ParallelApplyFoo( float a[], size_t n ) { 47 static affinity_partitioner ap; 48 parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), ap); 49 } 50 51 52 void TimeStepFoo( float a[], size_t n, int steps ) { 53 for( int t=0; t<steps; ++t ) 54 ParallelApplyFoo( a, n ); 55 } 56 57 58In the example, the ``affinity_partitioner`` object ``ap`` lives between 59loop iterations. It remembers where iterations of the loop ran, so that 60each iteration can be handed to the same thread that executed it before. 61The example code gets the lifetime of the partitioner right by declaring 62the ``affinity_partitioner`` as a local static object. Another approach 63would be to declare it at a scope outside the iterative loop in 64``TimeStepFoo``, and hand it down the call chain to ``parallel_for``. 65 66 67If the data does not fit across the system’s caches, there may be little 68benefit. The following figure shows the situations. 69 70 71.. container:: fignone 72 :name: fig3 73 74 75 Benefit of Affinity Determined by Relative Size of Data Set and Cache 76 |image0| 77 78 79The next figure shows how parallel speedup might vary with the size of a 80data set. The computation for the example is ``A[i]+=B[i]`` for ``i`` in 81the range [0,N). It was chosen for dramatic effect. You are unlikely to 82see quite this much variation in your code. The graph shows not much 83improvement at the extremes. For small N, parallel scheduling overhead 84dominates, resulting in little speedup. For large N, the data set is too 85large to be carried in cache between loop invocations. The peak in the 86middle is the sweet spot for affinity. Hence ``affinity_partitioner`` 87should be considered a tool, not a cure-all, when there is a low ratio 88of computations to memory accesses. 89 90 91.. container:: fignone 92 :name: fig4 93 94 95 Improvement from Affinity Dependent on Array Size 96 |image1| 97 98 99 100 101.. |image0| image:: Images/image007.jpg 102 :width: 453px 103 :height: 178px 104.. |image1| image:: Images/image008.jpg 105 :width: 551px 106 :height: 192px 107 108