| /oneTBB/doc/ |
| H A D | Doxyfile.in | 6 # All text after a double hash (##) is considered a comment and is placed in 10 # The format is: 21 # that follow. The default is UTF-8 which is also the encoding used for all text 31 # project for which the documentation is generated. This name is used in the 39 # control system is used. 116 # text. Otherwise, the brief description is used as-is. If left blank, the 419 # is 0..9, the default is 0, corresponding to a cache size of 2^16=65536 704 # tag is left empty. 791 # Note: If this tag is empty the current directory is searched. 1212 # 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300 [all …]
|
| /oneTBB/doc/main/tbb_userguide/ |
| H A D | When_Task-Based_Programming_Is_Inappropriate.rst | 7 Using the task scheduler is usually the best approach to threading for 8 performance, however there are cases when the task scheduler is not 9 appropriate. The task scheduler is intended for high-performance 11 rarely block. However, if threads block frequently, there is a 13 is blocked, it is not working on any tasks. Blocking typically occurs 15 mutexes for long periods, your code is not likely to perform well 17 it is best to use full-blown threads for those. The task scheduler is
|
| H A D | Concurrent_Queue_Classes.rst | 9 push and pop elements from the queue. The queue is unbounded and has no 12 std::queue. The operation ``try_pop`` pops an item if it is available. 39 The equivalent thread-safe |full_name| code is: 54 definition of "first" is uncertain. Use of ``concurrent_queue`` 62 or to wait for the queue to become non-empty. Typically this is 66 Template class ``concurrent_bounded_queue<T,Alloc>`` is a variant that 88 ``concurrent_queue`` is empty, and there are ``n`` pending pop 91 ``empty()`` is defined to be true if and only if ``size()`` is not 98 causes ``push`` to block until there is room in the queue. Bounded 99 queues are slower than unbounded queues, so if there is a constraint [all …]
|
| H A D | use_input_node.rst | 7 By default, an ``input_node`` is constructed in the inactive state: 29 activated after the entire flow graph is constructed. 33 the ``input_node`` is constructed in the inactive state and activated after 51 squarer is connected. Later, when the edge to cuber is connected, cuber 55 In general it is safest to create your ``input_node`` objects in the inactive 56 state and then activate them after the whole graph is constructed. 62 the overlap of construction and execution. If your graph is a directed 65 edges in reverse topological order; that is, make the edges at the 69 src is activated just after its construction: 105 The above code is safe because the edge from ``func1`` to ``func2`` is made [all …]
|
| H A D | Mutual_Exclusion.rst | 9 exclusion is implemented by *mutexes* and *locks.* A mutex is an object 14 The simplest mutex is ``spin_mutex``. A thread trying to acquire a lock 16 ``spin_mutex`` is appropriate when the lock is held for only a few 51 inside routine ``AllocateNode`` may look unusual. Their role is to keep 59 the example is changed to 65 then the ``scoped_lock`` is destroyed when execution reaches the 93 It is recommended that you add extra braces where possible, to clarify 94 to maintainers which code is protected by the lock. 99 itself. The reason is that the C interface would not be exception safe, 113 type ``M``, the corresponding lock type is ``M::scoped_lock``. [all …]
|
| H A D | Controlling_Chunking_os.rst | 7 Chunking is controlled by a *partitioner* and a *grainsize.*\ To gain 16 argument form of the constructor is 18 ``grainsize`` is 1. It is in units of loop iterations per chunk. 25 The following code is the last example from parallel_for, modified to 48 There is also an intermediate level of control where you specify the 50 ``affinity_partitioner``. An ``auto_partitioner`` is the default 62 Because of the impact of grainsize on parallel loops, it is worth 101 ``grainsize``\ =100,000. The rationale is that each iteration 113 A drawback of setting a grainsize too high is that it can reduce 143 that with a grainsize of one, most of the overhead is parallel [all …]
|
| H A D | Nodes.rst | 7 A node is a class that inherits from oneapi::tbb::flow::graph_node and also 14 While it is possible to define your own node types by inheriting from 15 graph_node, sender and receiver, it is more typical that predefined node 19 A ``function_node`` is a predefined type available in ``flow_graph.h`` and 50 Below is code for creating a simple graph that contains a single 51 function_node. In this example, a node n is constructed that belongs to 53 invocation of the node to occur concurrently. The body is a lambda 56 for the function spin_for is not provided. 71 After the node is constructed in the example above, you can pass 73 invoking its function try_put. Using edges is described in the next [all …]
|
| H A D | Predefined_Node_Types.rst | 8 class sender and class receiver but it is likely that you can create 10 flow_graph.h. Below is a table that lists all of the predefined types 26 …When activated, it executes a user body to generate its output. Its body is invoked if downstream … 27 …Otherwise, the previous output is temporarily buffered until it is accepted downstream and then th… 29 …ntrollable concurrency level and buffering policy. For each input exactly one output is returned. 31 …utes a user body when it receives N continue_msg objects at its input. N is equal to the numbe… 39 …is a tuple of these generic types. The node combines one message from each input port to create … 41 …- A single-input, multi-output node. The input type is a tuple of generic types and there is on… 43 …sors. After broadcast, the nodes retain the last message received, so it is available to any f… 47 …l of its successors. The input type is a list of generic types and the output type is a tagged_m…
|
| H A D | appendix_A.rst | 8 threads. Each logical thread is serviced for a *time slice* by a 14 The most obvious is the time for *context switching* between logical 20 A more subtle cost is *cache cooling*. Processors keep recently accessed 21 data in cache memory, which is very fast, but also relatively small 25 reality of set-associative caches is a bit more complicated, but this is 28 into cache, taking hundreds of cycles. If it is referenced frequently 30 cache, and only take a few cycles. Such data is called "hot in cache". 40 Another cost is *lock preemption.* This happens if a thread acquires a 43 it is now going to hold it for at least as long as it takes for its next 46 effect is called *convoying*, because the threads end up "bumper to
|
| H A D | Non-Linear_Pipelines.rst | 25 parallelism is lost by forcing a linear order on the filters, but in 26 fact the only loss is in the *latency* of the pipeline, not the 27 throughput. The latency is the time it takes a token to flow from the 29 processors, the latency of the original non-linear pipeline is three 30 filters. This is because filters A and B could process the token 42 In the linear pipeline, the latency is five filters. The behavior of 49 throughput is still limited by the throughput of the slowest serial 52 The linear limitation of ``parallel_pipeline`` is a good tradeoff of
|
| H A D | catching_exceptions.rst | 8 normally, as you might expect. If an exception is thrown but is not 10 of the graph's nodes are canceled and the exception is rethrown at the 41 that is not caught within the body. This will cause the execution of the 43 g.wait_for_all(). Since it is not handled there either, the program will 62 If the exception is caught and handled in the body, then there is no 77 In this case, the execution of the graph is canceled. For our example,
|
| H A D | cancelling_nested_parallelism.rst | 7 Nested parallelism is canceled if the inner context is bound to the 8 outer context; otherwise it is not. 11 If the execution of a flow graph is canceled, either explicitly or due 20 graph, it is created with an isolated context by default.
|
| H A D | Cook_Until_Done_parallel_do.rst | 7 For some loops, the end of the iteration space is not known in advance, 12 A linked list is an example of an iteration space that is not known in 13 advance. In parallel programming, it is usually better to use dynamic 15 is inherently serial. But if you are limited to linked lists, the items 36 qualified ``operator()``. This is similar to a C++ function object from 52 The parallel form of ``SerialApplyFooToList`` is as follows: 66 ``parallel_for_each`` unscalable, because the fetching of work is 80 a node in a tree is a prerequisite to processing its descendants.
|
| H A D | Task-Based_Programming.rst | 8 poor way to do multithreaded programming. It is much better to formulate 34 when there is exactly one running logical thread per physical thread. 47 The key advantage of tasks versus logical threads is that tasks are much 49 terminating a task is about 18 times faster than starting and 50 terminating a thread. On Windows systems, the ratio is more than 100. 51 This is because a thread has its own copy of a lot of resources, such as 53 id. A task in |full_name|, in contrast, is 59 distribute time slices in a round-robin fashion. This distribution is 61 Thread schedulers are typically fair because it is the safest strategy 70 number of threads, it is important to distribute work evenly across [all …]
|
| H A D | Parallelizing_Flow_Graph.rst | 14 edges. When a node in the graph receives a message, a task is spawned to 22 where a sequence of values is processed as each value passes through the 23 nodes in the graph. In this example, the sequence is created by a 27 completely processed, sum is equal to the sum of the sequence of squares 47 this example, a dependence graph is used to establish a partial ordering 54 together. This is a partial ordering because, for example, it doesn't 74 jar, are shared between ordered steps, it is not explicit in the graph. 75 Instead, only the required ordering of steps is explicit in a dependence 84 the runtime library spawns tasks to exploit the parallelism that is 90 is legal to execute in parallel, but allows the runtime library to [all …]
|
| H A D | Dependence_Graph.rst | 10 execute the body objects when it is legal to do so according to the 33 task to execute their body when this count is equal to the total number 42 complete before D and F start executing. This is a partial ordering 43 because, for example, there is no explicit ordering requirement between 71 The first argument is the graph it belongs to and the second is a 73 continue_node is always assumed to have unlimited concurrency and will 77 The following code snippet is an implementation of the example in this 108 One possible execution of this graph is shown below. The execution of D 126 Again, it is important to note that all execution in the flow graph 133 oneTBB work pool while it is waiting. [all …]
|
| /oneTBB/doc/main/tbb_userguide/design_patterns/ |
| H A D | Reference_Counting.rst | 22 Often it is desirable to destroy an object when it is known that it 23 will not be used in the future. Reference counting is a common serial 33 - If there are cycles of references, basic reference counting is 34 insufficient unless the cycle is explicitly broken. 37 - Atomic counting is relatively expensive in hardware. 46 Thread-safe reference counting is like serial reference counting, 47 except that the increment/decrement is done atomically, and the 48 decrement and test "count is zero?" must act as a single atomic 75 It is incorrect to use a separate read for testing if the count is 91 writes complete before the object is deleted. [all …]
|
| H A D | Agglomeration.rst | 13 Parallelism is so fine grained that overhead of parallel scheduling 27 if each scalar addition is scheduled as a separate task, most of the 42 - The parallelism is for sake of performance and not required for 62 The choice of block topology is typically driven by two concerns: 76 If the loop is "small", on the order of less than 10,000 clock 101 analysis, be careful to consider that information is transferred in 119 so the solution is to treat subtrees as a groups as shown in the 130 Often such an agglomeration is achieved by recursing serially once 144 There agglomeration is part of a four step **PCAM** design method: 151 #. **C**\ ommunication – figure out what communication is required [all …]
|
| H A D | Divide_and_Conquer.rst | 22 Divide and conquer is widely used in serial algorithms. Common 36 - Splitting problem or merging solutions is relatively cheap 64 Quicksort is a classic divide-and-conquer algorithm. It divides a 83 provides a simple way to parallelize it. The parallel code is shown 101 Eventually the subsorts become small enough that serial execution is 121 The change is an instance of the Agglomeration pattern. 158 A parallel version is shown below. 191 The recursive walk is parallelized using class ``task_group`` to do 196 is introduced. Because it would be unsafe to update ``Hits`` 198 accumulate results. Because it is of type [all …]
|
| H A D | Elementwise.rst | 25 summary information is collected, use the Reduction pattern instead. 34 No information is carried or merged between the computations. 43 If the number of items is known in advance, use 52 If the pattern is followed by a reduction on the same data, consider 54 the combination of the two patterns is accomplished in a single sweep 65 Convolution is often used in signal processing. The convolution of a 66 filter ``c`` and signal ``x`` is computed as: 85 For simplicity, the fragment assumes that ``x`` is a pointer into an 92 fits the elementwise pattern. It is straightforward to render it 109 there is reason to agglomerate explicitly, use the overload of
|
| /oneTBB/ |
| H A D | Bazel.md | 3 The main build system of oneTBB is CMake*. 4 [Bazel*](https://bazel.build/) support is community-based. 10 Bazel is not recommended for use by oneTBB maintainers. Thus, it is not used internally. 15 The Bazel oneTBB build is currently only intended for a subset of oneTBB that suffices restricted u… 18 The standard Bazel approach to handling third-party libraries is static linking. It is the best pra… 24 The following file structure is assumed: 45 In the *WORKSPACE* file, the oneTBB GitHub* repository is fetched. 77 The expected output of this program is the current version of oneTBB. 88 This flag is supported by the GNU* Compiler Collection (GCC) version 9.3, Clang* 12, and newer vers…
|
| /oneTBB/examples/parallel_for/polygon_overlay/ |
| H A D | README.md | 4 This example is a simple implementation of polygon overlay, as described in Parallelizing the [Poly… 9 …is split into submaps, with each resulting submap being intersected against the corresponding subm… 11 …is that the area of the generated sub-polygons are subtracted from the original area of one of the… 15 * the fact that for each submap, the number of polygons is smaller than that for the other two case… 17 …is approximately 80,000 (400 * 200, where 200 is the average number of polygons examined before st… 25 …is, if we need `N` polygons, then `N` "boxes" are chosen at random, then one-at-a-time the areas a… 27 One limitation of the program is that if the number of polygons in the source map is greater than t… 35 … on Windows, `x`,`con` on Linux and `mac`,`con` on macOS. The default mode is `con`. See the [comm…
|
| /oneTBB/examples/graph/cholesky/ |
| H A D | README.md | 4 …n or threaded implementation depending on the version of the oneMKL library that is linked against. 6 …hat uses the Crout-Cholesky algorithm for factorization. The same approach is parallelized for the… 8 …is similar to the serial implementation is used to create an unrolled version of the computation. … 10 …is a small, compact graph that passes tiles along its edges. There is one node per type of oneMKL … 33 …efix>_posdef.txt and <output_prefix>_X.txt; where `X` is the algorithm used. If `output_prefix` is…
|
| /oneTBB/examples/parallel_for_each/parallel_preorder/ |
| H A D | README.md | 4 …is called a "cell". Each cell has a value. The value is a matrix. Some of the cells have operators… 8 1. Compute the set of cells that have no inputs. This set is called `root_set`. 9 2. Each cell has an associated field `ref_count` that is an atomic integer. Initialize `ref_count` … 17 The example is using custom synchronization via ref_count atomic variable. Correctness checking too… 19 …te:** It is important to understand that this example is unlikely to show speedup if the cell valu… 42 * `n-of-nodes` - the number of nodes in the graph. Default value is 1000. 43 * `n-of-traversals` - the number of times to evaluate the graph. Default value is 500.
|
| /oneTBB/doc/main/reference/ |
| H A D | reference.rst | 7 version of oneAPI Specification is 1.0. 26 A preview feature is a component of oneTBB introduced to receive early feedback from 31 - It is off by default and must be explicitly enabled. 32 - It is intended to have a high quality implementation. 33 - There is no guarantee of future existence or compatibility. 38 A preview feature is subject to change in future. It might be removed or significantly 41 is strongly discouraged.
|