|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
23b64523 |
| 14-Jan-2025 |
Philip Yang <[email protected]> |
drm/amdgpu: Unlocked unmap only clear page table leaves
SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for range length
drm/amdgpu: Unlocked unmap only clear page table leaves
SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for range length >= huge page and free PTB bo, update mapping to alloc new PT bo. There is race bug that the freed entry bo maybe still on the pt_free list, reused when updating mapping and then freed, leave invalid PDE entry and cause GPU page fault.
By setting the update to clear only one PDE entry or clear PTB, to avoid unmap to free PTE bo. This fixes the race bug and improve the unmap and map to GPU performance. Update mapping to huge page will still free the PTB bo.
With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
74ef9527 |
| 19-Dec-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: track bo memory stats at runtime
Before, every time fdinfo is queried we try to lock all the BOs in the VM and calculate memory usage from scratch. This works okay if the fdinfo is rarel
drm/amdgpu: track bo memory stats at runtime
Before, every time fdinfo is queried we try to lock all the BOs in the VM and calculate memory usage from scratch. This works okay if the fdinfo is rarely read and the VMs don't have a ton of BOs. If either of these conditions is not true, we get a massive performance hit.
In this new revision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
7181faaa |
| 27-Aug-2024 |
Christian König <[email protected]> |
drm/amdgpu: nuke the VM PD/PT shadow handling
This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_sta
drm/amdgpu: nuke the VM PD/PT shadow handling
This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_state_machine() started to do the same.
Compute never used shadows either, so the only proplematic case left is SVM and that is most likely not recoverable in any way when VRAM is lost.
Signed-off-by: Christian König <[email protected]> Acked-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5 |
|
| #
4da5a95b |
| 20-Aug-2024 |
Christian König <[email protected]> |
drm/amdgpu: re-work VM syncing
Rework how VM operations synchronize to submissions. Provide an amdgpu_sync container to the backends instead of an reservation object and fill in the amdgpu_sync obje
drm/amdgpu: re-work VM syncing
Rework how VM operations synchronize to submissions. Provide an amdgpu_sync container to the backends instead of an reservation object and fill in the amdgpu_sync object in the higher layers of the code.
No intended functional change, just prepares for upcomming changes.
Signed-off-by: Christian König <[email protected]> Reviewed-by: Friedrich Vock <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
511a623f |
| 23-May-2024 |
Jesse Zhang <[email protected]> |
drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
The pointer parent may be NULLed by the function amdgpu_vm_pt_parent. To make the code more robust, check the point
drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
The pointer parent may be NULLed by the function amdgpu_vm_pt_parent. To make the code more robust, check the pointer parent.
Signed-off-by: Jesse Zhang <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2 |
|
| #
980a0a94 |
| 08-Mar-2023 |
Hawking Zhang <[email protected]> |
drm/amdgpu: support gfx v12 specific pte/pde fields
Add gfx v12 pte/pde support to gmc common helper.
v2: squash in fixes (Alex)
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: L
drm/amdgpu: support gfx v12 specific pte/pde fields
Add gfx v12 pte/pde support to gmc common helper.
v2: squash in fixes (Alex)
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Likun Gao <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a0cf3654 |
| 23-May-2024 |
Jesse Zhang <[email protected]> |
drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
The pointer parent may be NULLed by the function amdgpu_vm_pt_parent. To make the code more robust, check the point
drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
The pointer parent may be NULLed by the function amdgpu_vm_pt_parent. To make the code more robust, check the pointer parent.
Signed-off-by: Jesse Zhang <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f88a7dd0 |
| 21-Mar-2024 |
Shashank Sharma <[email protected]> |
drm/amdgpu: Add a NULL check for freeing root PT
This patch adds a NULL check to fix this crash reported during the freeing of root PT entry:
BUG: unable to handle page fault for address: ffffc900
drm/amdgpu: Add a NULL check for freeing root PT
This patch adds a NULL check to fix this crash reported during the freeing of root PT entry:
BUG: unable to handle page fault for address: ffffc9002d637aa0 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page RIP: 0010:amdgpu_vm_pt_free+0x66/0xe0 [amdgpu] PKRU: 55555554 Call Trace: <TASK> amdgpu_vm_pt_free_root+0x60/0xa0 [amdgpu] amdgpu_vm_fini+0x2cb/0x5d0 [amdgpu] ? amdgpu_ctx_mgr_entity_fini+0x53/0x1c0 [amdgpu] amdgpu_driver_postclose_kms+0x191/0x2d0 [amdgpu] drm_file_free.part.0+0x1e5/0x260 [drm]
Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Rajneesh Bhardwaj <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b6c4f90b |
| 18-Mar-2024 |
Shashank Sharma <[email protected]> |
drm/amdgpu: sync page table freeing with tlb flush
The idea behind this patch is to delay the freeing of PT entry objects until the TLB flush is done.
This patch: - Adds a tlb_flush_waitlist in amd
drm/amdgpu: sync page table freeing with tlb flush
The idea behind this patch is to delay the freeing of PT entry objects until the TLB flush is done.
This patch: - Adds a tlb_flush_waitlist in amdgpu_vm_update_params which will keep the objects that need to be freed after tlb_flush. - Adds PT entries in this list in amdgpu_vm_ptes_update after finding the PT entry. - Changes functionality of amdgpu_vm_pt_free_dfs from (df_search + free) to simply freeing of the BOs, also renames it to amdgpu_vm_pt_free_list to reflect this same. - Exports function amdgpu_vm_pt_free_list to be called directly. - Calls amdgpu_vm_pt_free_list directly from amdgpu_vm_update_range.
V2: rebase V4: Addressed review comments from Christian - add only locked PTEs entries in TLB flush waitlist. - do not create a separate function for list flush. - do not create a new lock for TLB flush. - there is no need to wait on tlb_flush_fence exclusively.
V5: Addressed review comments from Christian - change the amdgpu_vm_pt_free_dfs's functionality to simple freeing of the objects and rename it. - add all the PTE objects in params->tlb_flush_waitlist - let amdgpu_vm_pt_free_root handle the freeing of BOs independently - call amdgpu_vm_pt_free_list directly
V6: Rebase V7: Rebase V8: Added a NULL check to fix this backtrace issue: [ 415.351447] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 415.359245] #PF: supervisor write access in kernel mode [ 415.365081] #PF: error_code(0x0002) - not-present page [ 415.370817] PGD 101259067 P4D 101259067 PUD 10125a067 PMD 0 [ 415.377140] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 415.382004] CPU: 0 PID: 25481 Comm: test_with_MPI.e Tainted: G OE 5.18.2-mi300-build-140423-ubuntu-22.04+ #24 [ 415.394437] Hardware name: AMD Corporation Sh51p/Sh51p, BIOS RMO1001AS 02/21/2024 [ 415.402797] RIP: 0010:amdgpu_vm_ptes_update+0x6fd/0xa10 [amdgpu] [ 415.409648] Code: 4c 89 ff 4d 8d 66 30 e8 f1 ed ff ff 48 85 db 74 42 48 39 5d a0 74 40 48 8b 53 20 48 8b 4b 18 48 8d 43 18 48 8d 75 b0 4c 89 ff <48 > 89 51 08 48 89 0a 49 8b 56 30 48 89 42 08 48 89 53 18 4c 89 63 [ 415.430621] RSP: 0018:ffffc9000401f990 EFLAGS: 00010287 [ 415.436456] RAX: ffff888147bb82f0 RBX: ffff888147bb82d8 RCX: 0000000000000000 [ 415.444426] RDX: 0000000000000000 RSI: ffffc9000401fa30 RDI: ffff888161f80000 [ 415.452397] RBP: ffffc9000401fa80 R08: 0000000000000000 R09: ffffc9000401fa00 [ 415.460368] R10: 00000007f0cc0000 R11: 00000007f0c85000 R12: ffffc9000401fb20 [ 415.468340] R13: 00000007f0d00000 R14: ffffc9000401faf0 R15: ffff888161f80000 [ 415.476312] FS: 00007f132ff89840(0000) GS:ffff889f87c00000(0000) knlGS:0000000000000000 [ 415.485350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 415.491767] CR2: 0000000000000008 CR3: 0000000161d46003 CR4: 0000000000770ef0 [ 415.499738] PKRU: 55555554 [ 415.502750] Call Trace: [ 415.505482] <TASK> [ 415.507825] amdgpu_vm_update_range+0x32a/0x880 [amdgpu] [ 415.513869] amdgpu_vm_clear_freed+0x117/0x250 [amdgpu] [ 415.519814] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x18c/0x250 [amdgpu] [ 415.527729] kfd_ioctl_unmap_memory_from_gpu+0xed/0x340 [amdgpu] [ 415.534551] kfd_ioctl+0x3b6/0x510 [amdgpu]
V9: Addressed review comments from Christian - No NULL check reqd for root PT freeing - Free PT list regardless of needs_flush - Move adding BOs in list in a separate function
V10: Added Christian's RB V11: squash in list fix
Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Rajneesh Bhardwaj <[email protected]> Acked-by: Felix Kuehling <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Reviewed-by: Christian König <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d8a3f0a0 |
| 18-Mar-2024 |
Christian Koenig <[email protected]> |
drm/amdgpu: implement TLB flush fence
The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out by invalidating the TLB before t
drm/amdgpu: implement TLB flush fence
The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out by invalidating the TLB before the PT can be freed.
Solve this by moving the TLB flush into a DMA-fence object which can be used to delay the freeing of the PT BOs until it is signaled.
V2: (Shashank) - rebase - set dma_fence_error only in case of error - add tlb_flush fence only when PT/PD BO is locked (Felix) - use vm->pasid when f is NULL (Mukul)
V4: - add a wait for (f->dependency) in tlb_fence_work (Christian) - move the misplaced fence_create call to the end (Philip)
V5: - free the f->dependency properly
V6: (Shashank) - light code movement, moved all the clean-up in previous patch - introduce params.needs_flush and its usage in this patch - rebase without TLB HW sequence patch
V7: - Keep the vm->last_update_fence and tlb_cb code until we can fix the HW sequencing (Christian) - Move all the tlb_fence related code in a separate function so that its easier to read and review
V9: Addressed review comments from Christian - start PT update only when we have callback memory allocated
V10: - handle device unlock in OOM case (Christian, Mukul) - added Christian's R-B
Cc: Christian Koenig <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Rajneesh Bhardwaj <[email protected]> Cc: Alex Deucher <[email protected]> Acked-by: Felix Kuehling <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Reviewed-by: Christian Koenig <[email protected]> Signed-off-by: Christian Koenig <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
bb8863cc |
| 05-Mar-2024 |
Jesse Zhang <[email protected]> |
drm/amdgpu: remove unused code
Remove the unused function - amdgpu_vm_pt_is_root_clean and remove the impossible condition
v1: entries == 0 is not possible any more, so this condition could pro
drm/amdgpu: remove unused code
Remove the unused function - amdgpu_vm_pt_is_root_clean and remove the impossible condition
v1: entries == 0 is not possible any more, so this condition could probably be removed (Felix)
Signed-off-by: Jesse Zhang <[email protected]> Suggested-by:Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b8f67b9d |
| 18-Jan-2024 |
Shashank Sharma <[email protected]> |
drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its ua
drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm.
This patch also does logistical changes required for existing usage of vm->task_info.
V2: Do not block all the prints when task_info not found (Felix)
V3: Fixed review comments from Felix - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini()
V4: Fixed review comments from Felix - fix double reference increment in create_task_info - change amdgpu_vm_get_task_info_pasid - additional changes in amdgpu_gem.c while porting
Cc: Christian Koenig <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Felix Kuehling <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
feb13f52 |
| 29-Feb-2024 |
Jesse Zhang <[email protected]> |
Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven
fix the issue: "amdgpu: Failed to create process VM object".
[Why]when amdgpu initialized, seq64 do mampping and up
Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven
fix the issue: "amdgpu: Failed to create process VM object".
[Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table. But when clifo run. It also initializes a vm for a process device through the function kfd_process_device_init_vm and ensure the root PD is clean through the function amdgpu_vm_pt_is_root_clean. So they have a conflict, and clinfo always failed.
v1: - remove all the pte_supports_ats stuff from the amdgpu_vm code (Felix)
Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ceb9a321 |
| 08-Dec-2023 |
Christian König <[email protected]> |
drm/amdgpu: fix tear down order in amdgpu_vm_pt_free
When freeing PD/PT with shadows it can happen that the shadow destruction races with detaching the PD/PT from the VM causing a NULL pointer deref
drm/amdgpu: fix tear down order in amdgpu_vm_pt_free
When freeing PD/PT with shadows it can happen that the shadow destruction races with detaching the PD/PT from the VM causing a NULL pointer dereference in the invalidation code.
Fix this by detaching the the PD/PT from the VM first and then freeing the shadow instead.
Signed-off-by: Christian König <[email protected]> Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/2867 Cc: <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
683b8c7e |
| 08-Dec-2023 |
Christian König <[email protected]> |
drm/amdgpu: fix tear down order in amdgpu_vm_pt_free
When freeing PD/PT with shadows it can happen that the shadow destruction races with detaching the PD/PT from the VM causing a NULL pointer deref
drm/amdgpu: fix tear down order in amdgpu_vm_pt_free
When freeing PD/PT with shadows it can happen that the shadow destruction races with detaching the PD/PT from the VM causing a NULL pointer dereference in the invalidation code.
Fix this by detaching the the PD/PT from the VM first and then freeing the shadow instead.
Signed-off-by: Christian König <[email protected]> Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/2867 Cc: <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
142262a1 |
| 12-Oct-2023 |
David Francis <[email protected]> |
drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and MTYPE_UC for nonlocal memory.
On NUMA systems, local memory gets the loc
drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and MTYPE_UC for nonlocal memory.
On NUMA systems, local memory gets the local mtype, set by an override callback. If EXT_COHERENT is set, memory will be set as MTYPE_UC by default, with local memory MTYPE_CC.
Add an option in the override function for this case, and add a check to ensure it is not used on UNCACHED memory.
V2: Combined APU and NUMA code into one patch V3: Fixed a potential nullptr in amdgpu_vm_bo_update
Signed-off-by: David Francis <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fdac8909 |
| 28-Sep-2023 |
Philip Yang <[email protected]> |
drm/amdgpu: ratelimited override pte flags messages
Use ratelimited version of dev_dbg to avoid flooding dmesg log. No functional change.
Signed-off-by: Philip Yang <[email protected]> Reviewed-b
drm/amdgpu: ratelimited override pte flags messages
Use ratelimited version of dev_dbg to avoid flooding dmesg log. No functional change.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f135b0fc |
| 20-Jul-2023 |
Yang Li <[email protected]> |
drm/amdgpu: Fix one kernel-doc comment
Use colon to separate parameter name from their specific meaning. silence the warning:
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:793: warning: Function parame
drm/amdgpu: Fix one kernel-doc comment
Use colon to separate parameter name from their specific meaning. silence the warning:
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:793: warning: Function parameter or member 'adev' not described in 'amdgpu_vm_pte_update_noretry_flags'
Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Yang Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
8e781271 |
| 13-Jul-2023 |
Guchun Chen <[email protected]> |
drm/amdgpu/vm: use the same xcp_id from root PD
Other PDs/PTs allocation should just use the same xcp_id as that stored in root PD.
Suggested-by: Christian König <[email protected]> Signed-o
drm/amdgpu/vm: use the same xcp_id from root PD
Other PDs/PTs allocation should just use the same xcp_id as that stored in root PD.
Suggested-by: Christian König <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
8ecee4cb |
| 13-Jul-2023 |
Guchun Chen <[email protected]> |
drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create
Recent code set xcp_id stored from file private data when opening device to amdgpu bo for accounting memory usage etc, but not all VMs
drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create
Recent code set xcp_id stored from file private data when opening device to amdgpu bo for accounting memory usage etc, but not all VMs are attached to this fpriv structure like the vm cases in amdgpu_mes_self_test, otherwise, KASAN will complain below out of bound access. And more importantly, VM code should not touch fpriv structure, so drop fpriv code handling from amdgpu_vm_pt.
[ 77.292314] BUG: KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.293845] Read of size 4 at addr ffff888102c48a48 by task modprobe/1069 [ 77.294146] Call Trace: [ 77.294178] <TASK> [ 77.294208] dump_stack_lvl+0x49/0x63 [ 77.294260] print_report+0x16f/0x4a6 [ 77.294307] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.295979] ? kasan_complete_mode_report_info+0x3c/0x200 [ 77.296057] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.297556] kasan_report+0xb4/0x130 [ 77.297609] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.299202] __asan_load4+0x6f/0x90 [ 77.299272] amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.300796] ? amdgpu_init+0x6e/0x1000 [amdgpu] [ 77.302222] ? amdgpu_vm_pt_clear+0x750/0x750 [amdgpu] [ 77.303721] ? preempt_count_sub+0x18/0xc0 [ 77.303786] amdgpu_vm_init+0x39e/0x870 [amdgpu] [ 77.305186] ? amdgpu_vm_wait_idle+0x90/0x90 [amdgpu] [ 77.306683] ? kasan_set_track+0x25/0x30 [ 77.306737] ? kasan_save_alloc_info+0x1b/0x30 [ 77.306795] ? __kasan_kmalloc+0x87/0xa0 [ 77.306852] amdgpu_mes_self_test+0x169/0x620 [amdgpu]
v2: without specifying xcp partition for PD/PT bo, the xcp id is -1.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2686 Fixes: 3ebfd221c1a8 ("drm/amdkfd: Store xcp partition id to amdgpu bo") Signed-off-by: Guchun Chen <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e379b5e7 |
| 13-Jul-2023 |
Guchun Chen <[email protected]> |
drm/amdgpu/vm: use the same xcp_id from root PD
Other PDs/PTs allocation should just use the same xcp_id as that stored in root PD.
Suggested-by: Christian König <[email protected]> Signed-o
drm/amdgpu/vm: use the same xcp_id from root PD
Other PDs/PTs allocation should just use the same xcp_id as that stored in root PD.
Suggested-by: Christian König <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5003ca63 |
| 13-Jul-2023 |
Guchun Chen <[email protected]> |
drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create
Recent code set xcp_id stored from file private data when opening device to amdgpu bo for accounting memory usage etc, but not all VMs
drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create
Recent code set xcp_id stored from file private data when opening device to amdgpu bo for accounting memory usage etc, but not all VMs are attached to this fpriv structure like the vm cases in amdgpu_mes_self_test, otherwise, KASAN will complain below out of bound access. And more importantly, VM code should not touch fpriv structure, so drop fpriv code handling from amdgpu_vm_pt.
[ 77.292314] BUG: KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.293845] Read of size 4 at addr ffff888102c48a48 by task modprobe/1069 [ 77.294146] Call Trace: [ 77.294178] <TASK> [ 77.294208] dump_stack_lvl+0x49/0x63 [ 77.294260] print_report+0x16f/0x4a6 [ 77.294307] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.295979] ? kasan_complete_mode_report_info+0x3c/0x200 [ 77.296057] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.297556] kasan_report+0xb4/0x130 [ 77.297609] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.299202] __asan_load4+0x6f/0x90 [ 77.299272] amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.300796] ? amdgpu_init+0x6e/0x1000 [amdgpu] [ 77.302222] ? amdgpu_vm_pt_clear+0x750/0x750 [amdgpu] [ 77.303721] ? preempt_count_sub+0x18/0xc0 [ 77.303786] amdgpu_vm_init+0x39e/0x870 [amdgpu] [ 77.305186] ? amdgpu_vm_wait_idle+0x90/0x90 [amdgpu] [ 77.306683] ? kasan_set_track+0x25/0x30 [ 77.306737] ? kasan_save_alloc_info+0x1b/0x30 [ 77.306795] ? __kasan_kmalloc+0x87/0xa0 [ 77.306852] amdgpu_mes_self_test+0x169/0x620 [amdgpu]
v2: without specifying xcp partition for PD/PT bo, the xcp id is -1.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2686 Fixes: 3ebfd221c1a8 ("drm/amdkfd: Store xcp partition id to amdgpu bo") Signed-off-by: Guchun Chen <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
eb58ad14 |
| 30-Jun-2023 |
Xiaogang Chen <[email protected]> |
drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm
When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and km
drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm
When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and kmap them to kernel virtual address space before kfd updates the vm that was created by gfx.
Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e77673d1 |
| 09-Jun-2023 |
Mukul Joshi <[email protected]> |
drm/amdgpu: Update invalid PTE flag setting
Update the invalid PTE flag setting with TF enabled. This is to ensure, in addition to transitioning the retry fault to a no-retry fault, it also causes t
drm/amdgpu: Update invalid PTE flag setting
Update the invalid PTE flag setting with TF enabled. This is to ensure, in addition to transitioning the retry fault to a no-retry fault, it also causes the wavefront to enter the trap handler. With the current setting, the fault only transitions to a no-retry fault. Additionally, have 2 sets of invalid PTE settings, one for TF enabled, the other for TF disabled. The setting with TF disabled, doesn't work with TF enabled.
Signed-off-by: Mukul Joshi <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
cbb63ecc |
| 29-May-2023 |
Horatio Zhang <[email protected]> |
drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram
Use the function of amdgpu_bo_vm_destroy to handle the resource release of shadow bo. During the amdgpu_mes_self_test, sh
drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram
Use the function of amdgpu_bo_vm_destroy to handle the resource release of shadow bo. During the amdgpu_mes_self_test, shadow bo released, but vmbo->shadow_list was not, which caused a null pointer reference error in amdgpu_device_recover_vram when GPU reset.
Fixes: 6c032c37ac3e ("drm/amdgpu: Fix vram recover doesn't work after whole GPU reset (v2)") Signed-off-by: xinhui pan <[email protected]> Signed-off-by: Horatio Zhang <[email protected]> Acked-by: Feifei Xu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|