1*23808afcSAlex Deucher=============== 2*23808afcSAlex Deucher GPU Debugging 3*23808afcSAlex Deucher=============== 4*23808afcSAlex Deucher 5*23808afcSAlex DeucherGPUVM Debugging 6*23808afcSAlex Deucher=============== 7*23808afcSAlex Deucher 8*23808afcSAlex DeucherTo aid in debugging GPU virtual memory related problems, the driver supports a 9*23808afcSAlex Deuchernumber of options module parameters: 10*23808afcSAlex Deucher 11*23808afcSAlex Deucher`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault. 12*23808afcSAlex Deucher 13*23808afcSAlex Deucher`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than 14*23808afcSAlex Deucherthe GPU. 15*23808afcSAlex Deucher 16*23808afcSAlex Deucher 17*23808afcSAlex DeucherDecoding a GPUVM Page Fault 18*23808afcSAlex Deucher=========================== 19*23808afcSAlex Deucher 20*23808afcSAlex DeucherIf you see a GPU page fault in the kernel log, you can decode it to figure 21*23808afcSAlex Deucherout what is going wrong in your application. A page fault in your kernel 22*23808afcSAlex Deucherlog may look something like this: 23*23808afcSAlex Deucher 24*23808afcSAlex Deucher:: 25*23808afcSAlex Deucher 26*23808afcSAlex Deucher [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425) 27*23808afcSAlex Deucher in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2) 28*23808afcSAlex Deucher VM_L2_PROTECTION_FAULT_STATUS:0x00301030 29*23808afcSAlex Deucher Faulty UTCL2 client ID: TCP (0x8) 30*23808afcSAlex Deucher MORE_FAULTS: 0x0 31*23808afcSAlex Deucher WALKER_ERROR: 0x0 32*23808afcSAlex Deucher PERMISSION_FAULTS: 0x3 33*23808afcSAlex Deucher MAPPING_ERROR: 0x0 34*23808afcSAlex Deucher RW: 0x0 35*23808afcSAlex Deucher 36*23808afcSAlex DeucherFirst you have the memory hub, gfxhub and mmhub. gfxhub is the memory 37*23808afcSAlex Deucherhub used for graphics, compute, and sdma on some chips. mmhub is the 38*23808afcSAlex Deuchermemory hub used for multi-media and sdma on some chips. 39*23808afcSAlex Deucher 40*23808afcSAlex DeucherNext you have the vmid and pasid. If the vmid is 0, this fault was likely 41*23808afcSAlex Deuchercaused by the kernel driver or firmware. If the vmid is non-0, it is generally 42*23808afcSAlex Deuchera fault in a user application. The pasid is used to link a vmid to a system 43*23808afcSAlex Deucherprocess id. If the process is active when the fault happens, the process 44*23808afcSAlex Deucherinformation will be printed. 45*23808afcSAlex Deucher 46*23808afcSAlex DeucherThe GPU virtual address that caused the fault comes next. 47*23808afcSAlex Deucher 48*23808afcSAlex DeucherThe client ID indicates the GPU block that caused the fault. 49*23808afcSAlex DeucherSome common client IDs: 50*23808afcSAlex Deucher 51*23808afcSAlex Deucher- CB/DB: The color/depth backend of the graphics pipe 52*23808afcSAlex Deucher- CPF: Command Processor Frontend 53*23808afcSAlex Deucher- CPC: Command Processor Compute 54*23808afcSAlex Deucher- CPG: Command Processor Graphics 55*23808afcSAlex Deucher- TCP/SQC/SQG: Shaders 56*23808afcSAlex Deucher- SDMA: SDMA engines 57*23808afcSAlex Deucher- VCN: Video encode/decode engines 58*23808afcSAlex Deucher- JPEG: JPEG engines 59*23808afcSAlex Deucher 60*23808afcSAlex DeucherPERMISSION_FAULTS describe what faults were encountered: 61*23808afcSAlex Deucher 62*23808afcSAlex Deucher- bit 0: the PTE was not valid 63*23808afcSAlex Deucher- bit 1: the PTE read bit was not set 64*23808afcSAlex Deucher- bit 2: the PTE write bit was not set 65*23808afcSAlex Deucher- bit 3: the PTE execute bit was not set 66*23808afcSAlex Deucher 67*23808afcSAlex DeucherFinally, RW, indicates whether the access was a read (0) or a write (1). 68*23808afcSAlex Deucher 69*23808afcSAlex DeucherIn the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to 70*23808afcSAlex Deucheran invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address 71*23808afcSAlex Deucher0x0000800102800000. The user can then inspect their shader code and resource 72*23808afcSAlex Deucherdescriptor state to determine what caused the GPU page fault. 73*23808afcSAlex Deucher 74*23808afcSAlex DeucherUMR 75*23808afcSAlex Deucher=== 76*23808afcSAlex Deucher 77*23808afcSAlex Deucher`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose 78*23808afcSAlex DeucherGPU debugging and diagnostics tool. Please see the umr 79*23808afcSAlex Deucher`documentation <https://umr.readthedocs.io/en/main/>`_ for more information 80*23808afcSAlex Deucherabout its capabilities. 81