1*23808afcSAlex Deucher===============
2*23808afcSAlex Deucher GPU Debugging
3*23808afcSAlex Deucher===============
4*23808afcSAlex Deucher
5*23808afcSAlex DeucherGPUVM Debugging
6*23808afcSAlex Deucher===============
7*23808afcSAlex Deucher
8*23808afcSAlex DeucherTo aid in debugging GPU virtual memory related problems, the driver supports a
9*23808afcSAlex Deuchernumber of options module parameters:
10*23808afcSAlex Deucher
11*23808afcSAlex Deucher`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
12*23808afcSAlex Deucher
13*23808afcSAlex Deucher`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
14*23808afcSAlex Deucherthe GPU.
15*23808afcSAlex Deucher
16*23808afcSAlex Deucher
17*23808afcSAlex DeucherDecoding a GPUVM Page Fault
18*23808afcSAlex Deucher===========================
19*23808afcSAlex Deucher
20*23808afcSAlex DeucherIf you see a GPU page fault in the kernel log, you can decode it to figure
21*23808afcSAlex Deucherout what is going wrong in your application.  A page fault in your kernel
22*23808afcSAlex Deucherlog may look something like this:
23*23808afcSAlex Deucher
24*23808afcSAlex Deucher::
25*23808afcSAlex Deucher
26*23808afcSAlex Deucher [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
27*23808afcSAlex Deucher   in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
28*23808afcSAlex Deucher VM_L2_PROTECTION_FAULT_STATUS:0x00301030
29*23808afcSAlex Deucher 	Faulty UTCL2 client ID: TCP (0x8)
30*23808afcSAlex Deucher 	MORE_FAULTS: 0x0
31*23808afcSAlex Deucher 	WALKER_ERROR: 0x0
32*23808afcSAlex Deucher 	PERMISSION_FAULTS: 0x3
33*23808afcSAlex Deucher 	MAPPING_ERROR: 0x0
34*23808afcSAlex Deucher 	RW: 0x0
35*23808afcSAlex Deucher
36*23808afcSAlex DeucherFirst you have the memory hub, gfxhub and mmhub.  gfxhub is the memory
37*23808afcSAlex Deucherhub used for graphics, compute, and sdma on some chips.  mmhub is the
38*23808afcSAlex Deuchermemory hub used for multi-media and sdma on some chips.
39*23808afcSAlex Deucher
40*23808afcSAlex DeucherNext you have the vmid and pasid.  If the vmid is 0, this fault was likely
41*23808afcSAlex Deuchercaused by the kernel driver or firmware.  If the vmid is non-0, it is generally
42*23808afcSAlex Deuchera fault in a user application.  The pasid is used to link a vmid to a system
43*23808afcSAlex Deucherprocess id.  If the process is active when the fault happens, the process
44*23808afcSAlex Deucherinformation will be printed.
45*23808afcSAlex Deucher
46*23808afcSAlex DeucherThe GPU virtual address that caused the fault comes next.
47*23808afcSAlex Deucher
48*23808afcSAlex DeucherThe client ID indicates the GPU block that caused the fault.
49*23808afcSAlex DeucherSome common client IDs:
50*23808afcSAlex Deucher
51*23808afcSAlex Deucher- CB/DB: The color/depth backend of the graphics pipe
52*23808afcSAlex Deucher- CPF: Command Processor Frontend
53*23808afcSAlex Deucher- CPC: Command Processor Compute
54*23808afcSAlex Deucher- CPG: Command Processor Graphics
55*23808afcSAlex Deucher- TCP/SQC/SQG: Shaders
56*23808afcSAlex Deucher- SDMA: SDMA engines
57*23808afcSAlex Deucher- VCN: Video encode/decode engines
58*23808afcSAlex Deucher- JPEG: JPEG engines
59*23808afcSAlex Deucher
60*23808afcSAlex DeucherPERMISSION_FAULTS describe what faults were encountered:
61*23808afcSAlex Deucher
62*23808afcSAlex Deucher- bit 0: the PTE was not valid
63*23808afcSAlex Deucher- bit 1: the PTE read bit was not set
64*23808afcSAlex Deucher- bit 2: the PTE write bit was not set
65*23808afcSAlex Deucher- bit 3: the PTE execute bit was not set
66*23808afcSAlex Deucher
67*23808afcSAlex DeucherFinally, RW, indicates whether the access was a read (0) or a write (1).
68*23808afcSAlex Deucher
69*23808afcSAlex DeucherIn the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
70*23808afcSAlex Deucheran invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
71*23808afcSAlex Deucher0x0000800102800000.  The user can then inspect their shader code and resource
72*23808afcSAlex Deucherdescriptor state to determine what caused the GPU page fault.
73*23808afcSAlex Deucher
74*23808afcSAlex DeucherUMR
75*23808afcSAlex Deucher===
76*23808afcSAlex Deucher
77*23808afcSAlex Deucher`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
78*23808afcSAlex DeucherGPU debugging and diagnostics tool.  Please see the umr
79*23808afcSAlex Deucher`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
80*23808afcSAlex Deucherabout its capabilities.
81