1# Multithreaded Stackshot 2 3Stackshot has been retrofitted to take advantage of multiple CPUs. This document 4details the design of multithreaded stackshot. 5 6## Terminology 7 8- **Initiating / Calling CPU**: The CPU which stackshot was called from. 9- **Main CPU**: The CPU which populates workqueues and collects global state. 10- **Auxiliary CPU**: A CPU which is not the main CPU. 11- **KCData**: The containerized data structure that stackshot outputs. See 12 `osfmk/kern/kcdata.h` for more information. 13 14## Overview 15 16When a stackshot is taken, the initiating CPU (the CPU from which stackshot was 17called) sets up state. Then, it enters the debugger trap, and IPIs the other 18cores into the debugger trap as well. The other CPUs call into stackshot from 19the debugger trap instead of spinning, and determine if they are eligible to 20work based on perfcontrol's recommendation. (We need to do this because even if 21a CPU is derecommended due to thermal limits or otherwise, it will still be 22IPI'd into the debugger trap, and we want to avoid overheating the CPU). 23 24On AMP systems, a suitable P-core is chosen to be the “main” CPU, and begins 25populating queues of tasks to be put into the stackshot and collecting bits of 26global state (On SMP systems, the initiating CPU is always assigned to be the 27main CPU). 28 29The other CPUs begin chipping away at the queues, and the main CPU joins 30in once it is done populating them. Once all CPUs are finished, they exit the 31debugger trap, interrupts are re-enabled, and the kcdata from all of the CPUs 32are collated together by the caller CPU. The output is identical to 33single-threaded stackshot. 34 35It is important to note that since stackshot happens outside of the context of 36the scheduler and with interrupts disabled, it does not use "actual" threads to 37do its work - each CPU has its own execution context and no context switching 38occurs. Nothing else runs on the system while a stackshot is happening; this 39allows for stackshot to grab an atomic snapshot of the entire system's state. 40 41## Work Queues 42 43In order to split up work between CPUs, each task is put into a workqueue for 44CPUs to pull from. On SMP systems, there is only one queue. On AMP systems, 45there are two, and tasks are sorted between the queues based on their 46"difficulty" (i.e. the number of threads they have). E cores will work on the 47easier queue first, and P cores will work on the harder queue first. Once a CPU 48finishes with its first queue, it will move on to the other. 49 50If latency collection is enabled, each CPU will record information about its run 51in a `stackshot_latency_cpu` structure in the KCData. This includes information 52such as the amount of time spent waiting for the queue and the number of tasks / 53threads processed by the CPU during its run. 54 55## Buffers and Memory 56 57Stackshot is given a fixed-size buffer upfront since it cannot allocate any 58memory for itself. The size estimation logic in multithreaded stackshot is 59improved from that of singlethreaded stackshot - it uses various heuristics such 60as the number of tasks and threads on the system, the flags passed, sizes of 61data structures, and a fudge factor to give a reasonable estimate for a buffer 62size. Should the buffer be too small, stackshot will try again with a bigger 63one. The number of tries is recorded in the `stackshot_latency_collection_v2` 64struct if latency collection is enabled. 65 66### Bump Allocator 67 68Stackshot uses a basic per-cluster bump allocator to allocate space within the 69buffer. Each cluster gets its own bump allocator to mitigate cache contention, 70with space split evenly between each cluster. If a cluster runs out of buffer 71space, it can reach into other clusters for more. 72 73Memory that is freed is put into a per-cluster freelist. Even if the data was 74originally allocated from a different cluster's buffer, it will be put into the 75current cluster's freelist (again, to reduce cache effects). The freelist is a 76last resort, and is only used if the current cluster's buffer space fills. 77 78Each CPU will report information about its buffers in its 79`stackshot_latency_cpu` struct. This includes the total amount of buffer space 80used and the amount of buffer space allocated from other clusters. 81 82### Linked-List kcdata 83 84Each CPU needs its own kcdata descriptor, but we don't know exactly how big each 85one should be ahead of time. Because of this, allocate kcdata buffers in 86reasonably-sized chunks as we need them. We also want the output to have each 87task in order (to keep the output identical to singlethreaded stackshot), so we 88maintain a linked list of these kcdata chunks for each task in the queue. 89 90The chunks are sized such that only one is needed for the average task. If we 91have any extra room at the end of the current chunk once we finish with a task, 92we can add it to the freelist - but this is not ideal. So, stackshot uses 93various heuristics including flags and current task / thread counts to estimate 94a good chunk size. The amount of memory added to the freelist is reported by 95named uint64 in the KCData (`stackshot_buf_overhead`). 96 97``` 98 Workqueue 99 100⎡ Task #1 ⎤ 101⎢ CPU 0 ⎥ 102⎣ kcdata* ⎦-->[ KCData A ]--[ KCData B ] 103⎡ Task #2 ⎤ 104⎢ CPU 1 ⎥ 105⎣ kcdata* ⎦-->[ KCData C ] 106⎡ Task #3 ⎤ 107⎢ CPU 2 ⎥ 108⎣ kcdata* ⎦-->[ KCData D ]--[ KCData E ]--[ KCData F ] 109 ... 110``` 111 112One the stackshot is finished and interrupts are reenabled, this data is woven 113back together into a single KCData buffer by the initiating thread, such that it 114is indistinguishable from the output of a singlethreaded stackshot (essentially, 115we memcpy the contents of each kcdata chunk into a single buffer, stripping off 116the headers and footers). 117 118## “Tracing” 119 120In debug and development builds, Stackshot takes a "trace" of itself during 121execution. There are circular per-cpu buffers containing a list of tracepoints, 122which consist of a timestamp, line number, and an arbitrary uintpr_t-sized piece 123of extra data. This allows for basic tracing of stackshot's execution on each 124CPU which can be seen from a debugger. 125 126By default, tracepoints are only emitted when stackshot runs into an error (with 127the error number as the data), but it's trivial to add more with the 128`STACKSHOT_TRACE(data)` macro. 129 130An lldb macro is in the works which will allow this data to be examined more 131easily, but for now, it can be examined in lldb with `showpcpu -V 132stackshot_trace_buffer`. 133 134## Panics 135 136During a panic stackshot, stackshot handles basically identically to how it did 137before (with a single CPU/thread) - with the only difference being that we can 138now take a stackshot if the system panicked during a stackshot, since state has 139been compartmentalized. If the system panics during a panic stackshot, another 140stackshot will not be taken. 141 142Since stackshot takes place entirely from within the debugger trap, if an 143auxilliary CPU (i.e. a CPU other than the one which initiated the stackshot) 144panics, it will not be able to acquire the debugger lock since it is already 145being held by the initiating CPU. To mitigate this, when a CPU panics during a 146stackshot, it sets a flag in stackshot's state to indicate there was a panic by 147calling into `stackshot_cpu_signal_panic`. 148 149There are checks for this flag at various points in stackshot, and once a CPU 150notices it is set, it will spin in place. Before the initiating CPU spins in 151place, it will release the debugger lock. Once all CPUs are spinning, the panic 152will continue. 153 154## Future Work 155 156- It might be more elegant to give stackshot its own IPI flavor instead of 157 piggybacking on the debugger trap. 158- The tracing buffer isn't easily inspected - an LLDB macro to walk the circular 159 buffer and print a trace would be helpful. 160- Chunk size is currently static for the entire stackshot - instead of 161 estimating it once, we could estimate it for every task to further eliminate 162 overhead. 163