1# Multithreaded Stackshot
2
3Stackshot has been retrofitted to take advantage of multiple CPUs. This document
4details the design of multithreaded stackshot.
5
6## Terminology
7
8- **Initiating / Calling CPU**: The CPU which stackshot was called from.
9- **Main CPU**: The CPU which populates workqueues and collects global state.
10- **Auxiliary CPU**: A CPU which is not the main CPU.
11- **KCData**: The containerized data structure that stackshot outputs. See
12  `osfmk/kern/kcdata.h` for more information.
13
14## Overview
15
16When a stackshot is taken, the initiating CPU (the CPU from which stackshot was
17called) sets up state. Then, it enters the debugger trap, and IPIs the other
18cores into the debugger trap as well. The other CPUs call into stackshot from
19the debugger trap instead of spinning, and determine if they are eligible to
20work based on perfcontrol's recommendation. (We need to do this because even if
21a CPU is derecommended due to thermal limits or otherwise, it will still be
22IPI'd into the debugger trap, and we want to avoid overheating the CPU).
23
24On AMP systems, a suitable P-core is chosen to be the “main” CPU, and begins
25populating queues of tasks to be put into the stackshot and collecting bits of
26global state (On SMP systems, the initiating CPU is always assigned to be the
27main CPU).
28
29The other CPUs begin chipping away at the queues, and the main CPU joins
30in once it is done populating them. Once all CPUs are finished, they exit the
31debugger trap, interrupts are re-enabled, and the kcdata from all of the CPUs
32are collated together by the caller CPU. The output is identical to
33single-threaded stackshot.
34
35It is important to note that since stackshot happens outside of the context of
36the scheduler and with interrupts disabled, it does not use "actual" threads to
37do its work - each CPU has its own execution context and no context switching
38occurs. Nothing else runs on the system while a stackshot is happening; this
39allows for stackshot to grab an atomic snapshot of the entire system's state.
40
41## Work Queues
42
43In order to split up work between CPUs, each task is put into a workqueue for
44CPUs to pull from. On SMP systems, there is only one queue. On AMP systems,
45there are two, and tasks are sorted between the queues based on their
46"difficulty" (i.e. the number of threads they have). E cores will work on the
47easier queue first, and P cores will work on the harder queue first. Once a CPU
48finishes with its first queue, it will move on to the other.
49
50If latency collection is enabled, each CPU will record information about its run
51in a `stackshot_latency_cpu` structure in the KCData. This includes information
52such as the amount of time spent waiting for the queue and the number of tasks /
53threads processed by the CPU during its run.
54
55## Buffers and Memory
56
57Stackshot is given a fixed-size buffer upfront since it cannot allocate any
58memory for itself. The size estimation logic in multithreaded stackshot is
59improved from that of singlethreaded stackshot - it uses various heuristics such
60as the number of tasks and threads on the system, the flags passed, sizes of
61data structures, and a fudge factor to give a reasonable estimate for a buffer
62size. Should the buffer be too small, stackshot will try again with a bigger
63one. The number of tries is recorded in the `stackshot_latency_collection_v2`
64struct if latency collection is enabled.
65
66### Bump Allocator
67
68Stackshot uses a basic per-cluster bump allocator to allocate space within the
69buffer. Each cluster gets its own bump allocator to mitigate cache contention,
70with space split evenly between each cluster. If a cluster runs out of buffer
71space, it can reach into other clusters for more.
72
73Memory that is freed is put into a per-cluster freelist. Even if the data was
74originally allocated from a different cluster's buffer, it will be put into the
75current cluster's freelist (again, to reduce cache effects). The freelist is a
76last resort, and is only used if the current cluster's buffer space fills.
77
78Each CPU will report information about its buffers in its
79`stackshot_latency_cpu` struct. This includes the total amount of buffer space
80used and the amount of buffer space allocated from other clusters.
81
82### Linked-List kcdata
83
84Each CPU needs its own kcdata descriptor, but we don't know exactly how big each
85one should be ahead of time. Because of this, allocate kcdata buffers in
86reasonably-sized chunks as we need them. We also want the output to have each
87task in order (to keep the output identical to singlethreaded stackshot), so we
88maintain a linked list of these kcdata chunks for each task in the queue.
89
90The chunks are sized such that only one is needed for the average task. If we
91have any extra room at the end of the current chunk once we finish with a task,
92we can add it to the freelist - but this is not ideal. So, stackshot uses
93various heuristics including flags and current task / thread counts to estimate
94a good chunk size. The amount of memory added to the freelist is reported by
95named uint64 in the KCData (`stackshot_buf_overhead`).
96
97```
98 Workqueue
99
100⎡ Task #1 ⎤
101⎢  CPU 0  ⎥
102⎣ kcdata* ⎦-->[ KCData A ]--[ KCData B ]
103⎡ Task #2 ⎤
104⎢  CPU 1  ⎥
105⎣ kcdata* ⎦-->[ KCData C ]
106⎡ Task #3 ⎤
107⎢  CPU 2  ⎥
108⎣ kcdata* ⎦-->[ KCData D ]--[ KCData E ]--[ KCData F ]
109    ...
110```
111
112One the stackshot is finished and interrupts are reenabled, this data is woven
113back together into a single KCData buffer by the initiating thread, such that it
114is indistinguishable from the output of a singlethreaded stackshot (essentially,
115we memcpy the contents of each kcdata chunk into a single buffer, stripping off
116the headers and footers).
117
118## “Tracing”
119
120In debug and development builds, Stackshot takes a "trace" of itself during
121execution. There are circular per-cpu buffers containing a list of tracepoints,
122which consist of a timestamp, line number, and an arbitrary uintpr_t-sized piece
123of extra data. This allows for basic tracing of stackshot's execution on each
124CPU which can be seen from a debugger.
125
126By default, tracepoints are only emitted when stackshot runs into an error (with
127the error number as the data), but it's trivial to add more with the
128`STACKSHOT_TRACE(data)` macro.
129
130An lldb macro is in the works which will allow this data to be examined more
131easily, but for now, it can be examined in lldb with `showpcpu -V
132stackshot_trace_buffer`.
133
134## Panics
135
136During a panic stackshot, stackshot handles basically identically to how it did
137before (with a single CPU/thread) - with the only difference being that we can
138now take a stackshot if the system panicked during a stackshot, since state has
139been compartmentalized. If the system panics during a panic stackshot, another
140stackshot will not be taken.
141
142Since stackshot takes place entirely from within the debugger trap, if an
143auxilliary CPU (i.e. a CPU other than the one which initiated the stackshot)
144panics, it will not be able to acquire the debugger lock since it is already
145being held by the initiating CPU. To mitigate this, when a CPU panics during a
146stackshot, it sets a flag in stackshot's state to indicate there was a panic by
147calling into `stackshot_cpu_signal_panic`.
148
149There are checks for this flag at various points in stackshot, and once a CPU
150notices it is set, it will spin in place. Before the initiating CPU spins in
151place, it will release the debugger lock. Once all CPUs are spinning, the panic
152will continue.
153
154## Future Work
155
156- It might be more elegant to give stackshot its own IPI flavor instead of
157  piggybacking on the debugger trap.
158- The tracing buffer isn't easily inspected - an LLDB macro to walk the circular
159  buffer and print a trace would be helpful.
160- Chunk size is currently static for the entire stackshot - instead of
161  estimating it once, we could estimate it for every task to further eliminate
162  overhead.
163