12fcd5affSSteven Rostedt==============================
22fcd5affSSteven RostedtUsing the tracer for debugging
32fcd5affSSteven Rostedt==============================
42fcd5affSSteven Rostedt
52fcd5affSSteven RostedtCopyright 2024 Google LLC.
62fcd5affSSteven Rostedt
72fcd5affSSteven Rostedt:Author:   Steven Rostedt <[email protected]>
82fcd5affSSteven Rostedt:License:  The GNU Free Documentation License, Version 1.2
92fcd5affSSteven Rostedt          (dual licensed under the GPL v2)
102fcd5affSSteven Rostedt
112fcd5affSSteven Rostedt- Written for: 6.12
122fcd5affSSteven Rostedt
132fcd5affSSteven RostedtIntroduction
142fcd5affSSteven Rostedt------------
152fcd5affSSteven RostedtThe tracing infrastructure can be very useful for debugging the Linux
162fcd5affSSteven Rostedtkernel. This document is a place to add various methods of using the tracer
172fcd5affSSteven Rostedtfor debugging.
182fcd5affSSteven Rostedt
192fcd5affSSteven RostedtFirst, make sure that the tracefs file system is mounted::
202fcd5affSSteven Rostedt
212fcd5affSSteven Rostedt $ sudo mount -t tracefs tracefs /sys/kernel/tracing
222fcd5affSSteven Rostedt
232fcd5affSSteven Rostedt
242fcd5affSSteven RostedtUsing trace_printk()
252fcd5affSSteven Rostedt--------------------
262fcd5affSSteven Rostedt
272fcd5affSSteven Rostedttrace_printk() is a very lightweight utility that can be used in any context
282fcd5affSSteven Rostedtinside the kernel, with the exception of "noinstr" sections. It can be used
292fcd5affSSteven Rostedtin normal, softirq, interrupt and even NMI context. The trace data is
302fcd5affSSteven Rostedtwritten to the tracing ring buffer in a lockless way. To make it even
312fcd5affSSteven Rostedtlighter weight, when possible, it will only record the pointer to the format
322fcd5affSSteven Rostedtstring, and save the raw arguments into the buffer. The format and the
332fcd5affSSteven Rostedtarguments will be post processed when the ring buffer is read. This way the
342fcd5affSSteven Rostedttrace_printk() format conversions are not done during the hot path, where
352fcd5affSSteven Rostedtthe trace is being recorded.
362fcd5affSSteven Rostedt
372fcd5affSSteven Rostedttrace_printk() is meant only for debugging, and should never be added into
382fcd5affSSteven Rostedta subsystem of the kernel. If you need debugging traces, add trace events
392fcd5affSSteven Rostedtinstead. If a trace_printk() is found in the kernel, the following will
402fcd5affSSteven Rostedtappear in the dmesg::
412fcd5affSSteven Rostedt
422fcd5affSSteven Rostedt  **********************************************************
432fcd5affSSteven Rostedt  **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
442fcd5affSSteven Rostedt  **                                                      **
452fcd5affSSteven Rostedt  ** trace_printk() being used. Allocating extra memory.  **
462fcd5affSSteven Rostedt  **                                                      **
472fcd5affSSteven Rostedt  ** This means that this is a DEBUG kernel and it is     **
482fcd5affSSteven Rostedt  ** unsafe for production use.                           **
492fcd5affSSteven Rostedt  **                                                      **
502fcd5affSSteven Rostedt  ** If you see this message and you are not debugging    **
512fcd5affSSteven Rostedt  ** the kernel, report this immediately to your vendor!  **
522fcd5affSSteven Rostedt  **                                                      **
532fcd5affSSteven Rostedt  **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
542fcd5affSSteven Rostedt  **********************************************************
552fcd5affSSteven Rostedt
562fcd5affSSteven RostedtDebugging kernel crashes
572fcd5affSSteven Rostedt------------------------
582fcd5affSSteven RostedtThere is various methods of acquiring the state of the system when a kernel
592fcd5affSSteven Rostedtcrash occurs. This could be from the oops message in printk, or one could
602fcd5affSSteven Rostedtuse kexec/kdump. But these just show what happened at the time of the crash.
612fcd5affSSteven RostedtIt can be very useful in knowing what happened up to the point of the crash.
622fcd5affSSteven RostedtThe tracing ring buffer, by default, is a circular buffer than will
632fcd5affSSteven Rostedtoverwrite older events with newer ones. When a crash happens, the content of
642fcd5affSSteven Rostedtthe ring buffer will be all the events that lead up to the crash.
652fcd5affSSteven Rostedt
662fcd5affSSteven RostedtThere are several kernel command line parameters that can be used to help in
672fcd5affSSteven Rostedtthis. The first is "ftrace_dump_on_oops". This will dump the tracing ring
682fcd5affSSteven Rostedtbuffer when a oops occurs to the console. This can be useful if the console
692fcd5affSSteven Rostedtis being logged somewhere. If a serial console is used, it may be prudent to
702fcd5affSSteven Rostedtmake sure the ring buffer is relatively small, otherwise the dumping of the
712fcd5affSSteven Rostedtring buffer may take several minutes to hours to finish. Here's an example
722fcd5affSSteven Rostedtof the kernel command line::
732fcd5affSSteven Rostedt
742fcd5affSSteven Rostedt  ftrace_dump_on_oops trace_buf_size=50K
752fcd5affSSteven Rostedt
762fcd5affSSteven RostedtNote, the tracing buffer is made up of per CPU buffers where each of these
772fcd5affSSteven Rostedtbuffers is broken up into sub-buffers that are by default PAGE_SIZE. The
782fcd5affSSteven Rostedtabove trace_buf_size option above sets each of the per CPU buffers to 50K,
792fcd5affSSteven Rostedtso, on a machine with 8 CPUs, that's actually 400K total.
802fcd5affSSteven Rostedt
812fcd5affSSteven RostedtPersistent buffers across boots
822fcd5affSSteven Rostedt-------------------------------
832fcd5affSSteven RostedtIf the system memory allows it, the tracing ring buffer can be specified at
842fcd5affSSteven Rostedta specific location in memory. If the location is the same across boots and
852fcd5affSSteven Rostedtthe memory is not modified, the tracing buffer can be retrieved from the
862fcd5affSSteven Rostedtfollowing boot. There's two ways to reserve memory for the use of the ring
872fcd5affSSteven Rostedtbuffer.
882fcd5affSSteven Rostedt
892fcd5affSSteven RostedtThe more reliable way (on x86) is to reserve memory with the "memmap" kernel
902fcd5affSSteven Rostedtcommand line option and then use that memory for the trace_instance. This
912fcd5affSSteven Rostedtrequires a bit of knowledge of the physical memory layout of the system. The
922fcd5affSSteven Rostedtadvantage of using this method, is that the memory for the ring buffer will
932fcd5affSSteven Rostedtalways be the same::
942fcd5affSSteven Rostedt
952fcd5affSSteven Rostedt  memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M
962fcd5affSSteven Rostedt
972fcd5affSSteven RostedtThe memmap above reserves 12 megabytes of memory at the physical memory
982fcd5affSSteven Rostedtlocation 0x284500000. Then the trace_instance option will create a trace
992fcd5affSSteven Rostedtinstance "boot_map" at that same location with the same amount of memory
1002fcd5affSSteven Rostedtreserved. As the ring buffer is broke up into per CPU buffers, the 12
1012fcd5affSSteven Rostedtmegabytes will be broken up evenly between those CPUs. If you have 8 CPUs,
1022fcd5affSSteven Rostedteach per CPU ring buffer will be 1.5 megabytes in size. Note, that also
1032fcd5affSSteven Rostedtincludes meta data, so the amount of memory actually used by the ring buffer
1042fcd5affSSteven Rostedtwill be slightly smaller.
1052fcd5affSSteven Rostedt
1062fcd5affSSteven RostedtAnother more generic but less robust way to allocate a ring buffer mapping
1072fcd5affSSteven Rostedtat boot is with the "reserve_mem" option::
1082fcd5affSSteven Rostedt
1092fcd5affSSteven Rostedt  reserve_mem=12M:4096:trace trace_instance=boot_map@trace
1102fcd5affSSteven Rostedt
1112fcd5affSSteven RostedtThe reserve_mem option above will find 12 megabytes that are available at
1122fcd5affSSteven Rostedtboot up, and align it by 4096 bytes. It will label this memory as "trace"
1132fcd5affSSteven Rostedtthat can be used by later command line options.
1142fcd5affSSteven Rostedt
1152fcd5affSSteven RostedtThe trace_instance option creates a "boot_map" instance and will use the
1162fcd5affSSteven Rostedtmemory reserved by reserve_mem that was labeled as "trace". This method is
1172fcd5affSSteven Rostedtmore generic but may not be as reliable. Due to KASLR, the memory reserved
1182fcd5affSSteven Rostedtby reserve_mem may not be located at the same location. If this happens,
1192fcd5affSSteven Rostedtthen the ring buffer will not be from the previous boot and will be reset.
1202fcd5affSSteven Rostedt
1212fcd5affSSteven RostedtSometimes, by using a larger alignment, it can keep KASLR from moving things
1222fcd5affSSteven Rostedtaround in such a way that it will move the location of the reserve_mem. By
1232fcd5affSSteven Rostedtusing a larger alignment, you may find better that the buffer is more
1242fcd5affSSteven Rostedtconsistent to where it is placed::
1252fcd5affSSteven Rostedt
1262fcd5affSSteven Rostedt  reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace
1272fcd5affSSteven Rostedt
1282fcd5affSSteven RostedtOn boot up, the memory reserved for the ring buffer is validated. It will go
1292fcd5affSSteven Rostedtthrough a series of tests to make sure that the ring buffer contains valid
1302fcd5affSSteven Rostedtdata. If it is, it will then set it up to be available to read from the
1312fcd5affSSteven Rostedtinstance. If it fails any of the tests, it will clear the entire ring buffer
1322fcd5affSSteven Rostedtand initialize it as new.
1332fcd5affSSteven Rostedt
1342fcd5affSSteven RostedtThe layout of this mapped memory may not be consistent from kernel to
1352fcd5affSSteven Rostedtkernel, so only the same kernel is guaranteed to work if the mapping is
1362fcd5affSSteven Rostedtpreserved. Switching to a different kernel version may find a different
1372fcd5affSSteven Rostedtlayout and mark the buffer as invalid.
1382fcd5affSSteven Rostedt
139*c44a14f2SSteven RostedtNB: Both the mapped address and size must be page aligned for the architecture.
140*c44a14f2SSteven Rostedt
1412fcd5affSSteven RostedtUsing trace_printk() in the boot instance
1422fcd5affSSteven Rostedt-----------------------------------------
1432fcd5affSSteven RostedtBy default, the content of trace_printk() goes into the top level tracing
1442fcd5affSSteven Rostedtinstance. But this instance is never preserved across boots. To have the
1452fcd5affSSteven Rostedttrace_printk() content, and some other internal tracing go to the preserved
1462fcd5affSSteven Rostedtbuffer (like dump stacks), either set the instance to be the trace_printk()
1472fcd5affSSteven Rostedtdestination from the kernel command line, or set it after boot up via the
1482fcd5affSSteven Rostedttrace_printk_dest option.
1492fcd5affSSteven Rostedt
1502fcd5affSSteven RostedtAfter boot up::
1512fcd5affSSteven Rostedt
1522fcd5affSSteven Rostedt  echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest
1532fcd5affSSteven Rostedt
1542fcd5affSSteven RostedtFrom the kernel command line::
1552fcd5affSSteven Rostedt
1562fcd5affSSteven Rostedt  reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace
1572fcd5affSSteven Rostedt
1582fcd5affSSteven RostedtIf setting it from the kernel command line, it is recommended to also
1592fcd5affSSteven Rostedtdisable tracing with the "traceoff" flag, and enable tracing after boot up.
1602fcd5affSSteven RostedtOtherwise the trace from the most recent boot will be mixed with the trace
1612fcd5affSSteven Rostedtfrom the previous boot, and may make it confusing to read.
162