History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c (Results 1 – 25 of 40)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13
# 23b64523 14-Jan-2025 Philip Yang <[email protected]>

drm/amdgpu: Unlocked unmap only clear page table leaves

SVM migration unmap pages from GPU and then update mapping to GPU to
recover page fault. Currently unmap clears the PDE entry for range
length

drm/amdgpu: Unlocked unmap only clear page table leaves

SVM migration unmap pages from GPU and then update mapping to GPU to
recover page fault. Currently unmap clears the PDE entry for range
length >= huge page and free PTB bo, update mapping to alloc new PT bo.
There is race bug that the freed entry bo maybe still on the pt_free
list, reused when updating mapping and then freed, leave invalid PDE
entry and cause GPU page fault.

By setting the update to clear only one PDE entry or clear PTB, to
avoid unmap to free PTE bo. This fixes the race bug and improve the
unmap and map to GPU performance. Update mapping to huge page will
still free the PTB bo.

With this change, the vm->pt_freed list and work is not needed. Add
WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the
PTB.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4
# 74ef9527 19-Dec-2024 Yunxiang Li <[email protected]>

drm/amdgpu: track bo memory stats at runtime

Before, every time fdinfo is queried we try to lock all the BOs in the
VM and calculate memory usage from scratch. This works okay if the
fdinfo is rarel

drm/amdgpu: track bo memory stats at runtime

Before, every time fdinfo is queried we try to lock all the BOs in the
VM and calculate memory usage from scratch. This works okay if the
fdinfo is rarely read and the VMs don't have a ton of BOs. If either of
these conditions is not true, we get a massive performance hit.

In this new revision, we track the BOs as they change states. This way
when the fdinfo is queried we only need to take the status lock and copy
out the usage stats with minimal impact to the runtime performance. With
this new approach however, we would no longer be able to track active
buffers.

Signed-off-by: Yunxiang Li <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>

show more ...


Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6
# 7181faaa 27-Aug-2024 Christian König <[email protected]>

drm/amdgpu: nuke the VM PD/PT shadow handling

This was only used as workaround for recovering the page tables after
VRAM was lost and is no longer necessary after the function
amdgpu_vm_bo_reset_sta

drm/amdgpu: nuke the VM PD/PT shadow handling

This was only used as workaround for recovering the page tables after
VRAM was lost and is no longer necessary after the function
amdgpu_vm_bo_reset_state_machine() started to do the same.

Compute never used shadows either, so the only proplematic case left is
SVM and that is most likely not recoverable in any way when VRAM is
lost.

Signed-off-by: Christian König <[email protected]>
Acked-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc5
# 4da5a95b 20-Aug-2024 Christian König <[email protected]>

drm/amdgpu: re-work VM syncing

Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync obje

drm/amdgpu: re-work VM syncing

Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync object in the higher layers
of the code.

No intended functional change, just prepares for upcomming changes.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Friedrich Vock <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1
# 511a623f 23-May-2024 Jesse Zhang <[email protected]>

drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent

The pointer parent may be NULLed by the function amdgpu_vm_pt_parent.
To make the code more robust, check the point

drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent

The pointer parent may be NULLed by the function amdgpu_vm_pt_parent.
To make the code more robust, check the pointer parent.

Signed-off-by: Jesse Zhang <[email protected]>
Suggested-by: Christian König <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2
# 980a0a94 08-Mar-2023 Hawking Zhang <[email protected]>

drm/amdgpu: support gfx v12 specific pte/pde fields

Add gfx v12 pte/pde support to gmc common helper.

v2: squash in fixes (Alex)

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: L

drm/amdgpu: support gfx v12 specific pte/pde fields

Add gfx v12 pte/pde support to gmc common helper.

v2: squash in fixes (Alex)

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Likun Gao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a0cf3654 23-May-2024 Jesse Zhang <[email protected]>

drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent

The pointer parent may be NULLed by the function amdgpu_vm_pt_parent.
To make the code more robust, check the point

drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent

The pointer parent may be NULLed by the function amdgpu_vm_pt_parent.
To make the code more robust, check the pointer parent.

Signed-off-by: Jesse Zhang <[email protected]>
Suggested-by: Christian König <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f88a7dd0 21-Mar-2024 Shashank Sharma <[email protected]>

drm/amdgpu: Add a NULL check for freeing root PT

This patch adds a NULL check to fix this crash reported during the
freeing of root PT entry:

BUG: unable to handle page fault for address: ffffc900

drm/amdgpu: Add a NULL check for freeing root PT

This patch adds a NULL check to fix this crash reported during the
freeing of root PT entry:

BUG: unable to handle page fault for address: ffffc9002d637aa0
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
RIP: 0010:amdgpu_vm_pt_free+0x66/0xe0 [amdgpu]
PKRU: 55555554
Call Trace:
<TASK>
amdgpu_vm_pt_free_root+0x60/0xa0 [amdgpu]
amdgpu_vm_fini+0x2cb/0x5d0 [amdgpu]
? amdgpu_ctx_mgr_entity_fini+0x53/0x1c0 [amdgpu]
amdgpu_driver_postclose_kms+0x191/0x2d0 [amdgpu]
drm_file_free.part.0+0x1e5/0x260 [drm]

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Rajneesh Bhardwaj <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b6c4f90b 18-Mar-2024 Shashank Sharma <[email protected]>

drm/amdgpu: sync page table freeing with tlb flush

The idea behind this patch is to delay the freeing of PT entry objects
until the TLB flush is done.

This patch:
- Adds a tlb_flush_waitlist in amd

drm/amdgpu: sync page table freeing with tlb flush

The idea behind this patch is to delay the freeing of PT entry objects
until the TLB flush is done.

This patch:
- Adds a tlb_flush_waitlist in amdgpu_vm_update_params which will keep the
objects that need to be freed after tlb_flush.
- Adds PT entries in this list in amdgpu_vm_ptes_update after finding
the PT entry.
- Changes functionality of amdgpu_vm_pt_free_dfs from (df_search + free)
to simply freeing of the BOs, also renames it to
amdgpu_vm_pt_free_list to reflect this same.
- Exports function amdgpu_vm_pt_free_list to be called directly.
- Calls amdgpu_vm_pt_free_list directly from amdgpu_vm_update_range.

V2: rebase
V4: Addressed review comments from Christian
- add only locked PTEs entries in TLB flush waitlist.
- do not create a separate function for list flush.
- do not create a new lock for TLB flush.
- there is no need to wait on tlb_flush_fence exclusively.

V5: Addressed review comments from Christian
- change the amdgpu_vm_pt_free_dfs's functionality to simple freeing
of the objects and rename it.
- add all the PTE objects in params->tlb_flush_waitlist
- let amdgpu_vm_pt_free_root handle the freeing of BOs independently
- call amdgpu_vm_pt_free_list directly

V6: Rebase
V7: Rebase
V8: Added a NULL check to fix this backtrace issue:
[ 415.351447] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 415.359245] #PF: supervisor write access in kernel mode
[ 415.365081] #PF: error_code(0x0002) - not-present page
[ 415.370817] PGD 101259067 P4D 101259067 PUD 10125a067 PMD 0
[ 415.377140] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 415.382004] CPU: 0 PID: 25481 Comm: test_with_MPI.e Tainted: G OE 5.18.2-mi300-build-140423-ubuntu-22.04+ #24
[ 415.394437] Hardware name: AMD Corporation Sh51p/Sh51p, BIOS RMO1001AS 02/21/2024
[ 415.402797] RIP: 0010:amdgpu_vm_ptes_update+0x6fd/0xa10 [amdgpu]
[ 415.409648] Code: 4c 89 ff 4d 8d 66 30 e8 f1 ed ff ff 48 85 db 74 42 48 39 5d a0 74 40 48 8b 53 20 48 8b 4b 18 48 8d 43 18 48 8d 75 b0 4c 89 ff <48
> 89 51 08 48 89 0a 49 8b 56 30 48 89 42 08 48 89 53 18 4c 89 63
[ 415.430621] RSP: 0018:ffffc9000401f990 EFLAGS: 00010287
[ 415.436456] RAX: ffff888147bb82f0 RBX: ffff888147bb82d8 RCX: 0000000000000000
[ 415.444426] RDX: 0000000000000000 RSI: ffffc9000401fa30 RDI: ffff888161f80000
[ 415.452397] RBP: ffffc9000401fa80 R08: 0000000000000000 R09: ffffc9000401fa00
[ 415.460368] R10: 00000007f0cc0000 R11: 00000007f0c85000 R12: ffffc9000401fb20
[ 415.468340] R13: 00000007f0d00000 R14: ffffc9000401faf0 R15: ffff888161f80000
[ 415.476312] FS: 00007f132ff89840(0000) GS:ffff889f87c00000(0000) knlGS:0000000000000000
[ 415.485350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 415.491767] CR2: 0000000000000008 CR3: 0000000161d46003 CR4: 0000000000770ef0
[ 415.499738] PKRU: 55555554
[ 415.502750] Call Trace:
[ 415.505482] <TASK>
[ 415.507825] amdgpu_vm_update_range+0x32a/0x880 [amdgpu]
[ 415.513869] amdgpu_vm_clear_freed+0x117/0x250 [amdgpu]
[ 415.519814] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x18c/0x250 [amdgpu]
[ 415.527729] kfd_ioctl_unmap_memory_from_gpu+0xed/0x340 [amdgpu]
[ 415.534551] kfd_ioctl+0x3b6/0x510 [amdgpu]

V9: Addressed review comments from Christian
- No NULL check reqd for root PT freeing
- Free PT list regardless of needs_flush
- Move adding BOs in list in a separate function

V10: Added Christian's RB
V11: squash in list fix

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Rajneesh Bhardwaj <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Reviewed-by: Christian König <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d8a3f0a0 18-Mar-2024 Christian Koenig <[email protected]>

drm/amdgpu: implement TLB flush fence

The problem is that when (for example) 4k pages are replaced
with a single 2M page we need to wait for change to be flushed
out by invalidating the TLB before t

drm/amdgpu: implement TLB flush fence

The problem is that when (for example) 4k pages are replaced
with a single 2M page we need to wait for change to be flushed
out by invalidating the TLB before the PT can be freed.

Solve this by moving the TLB flush into a DMA-fence object which
can be used to delay the freeing of the PT BOs until it is signaled.

V2: (Shashank)
- rebase
- set dma_fence_error only in case of error
- add tlb_flush fence only when PT/PD BO is locked (Felix)
- use vm->pasid when f is NULL (Mukul)

V4: - add a wait for (f->dependency) in tlb_fence_work (Christian)
- move the misplaced fence_create call to the end (Philip)

V5: - free the f->dependency properly

V6: (Shashank)
- light code movement, moved all the clean-up in previous patch
- introduce params.needs_flush and its usage in this patch
- rebase without TLB HW sequence patch

V7:
- Keep the vm->last_update_fence and tlb_cb code until
we can fix the HW sequencing (Christian)
- Move all the tlb_fence related code in a separate function so that
its easier to read and review

V9: Addressed review comments from Christian
- start PT update only when we have callback memory allocated

V10:
- handle device unlock in OOM case (Christian, Mukul)
- added Christian's R-B

Cc: Christian Koenig <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Rajneesh Bhardwaj <[email protected]>
Cc: Alex Deucher <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Reviewed-by: Shashank Sharma <[email protected]>
Reviewed-by: Christian Koenig <[email protected]>
Signed-off-by: Christian Koenig <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# bb8863cc 05-Mar-2024 Jesse Zhang <[email protected]>

drm/amdgpu: remove unused code

Remove the unused function - amdgpu_vm_pt_is_root_clean
and remove the impossible condition

v1: entries == 0 is not possible any more,
so this condition could pro

drm/amdgpu: remove unused code

Remove the unused function - amdgpu_vm_pt_is_root_clean
and remove the impossible condition

v1: entries == 0 is not possible any more,
so this condition could probably be removed (Felix)

Signed-off-by: Jesse Zhang <[email protected]>
Suggested-by:Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b8f67b9d 18-Jan-2024 Shashank Sharma <[email protected]>

drm/amdgpu: change vm->task_info handling

This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its ua

drm/amdgpu: change vm->task_info handling

This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
reference counted.
- introducing two new helper funcs for task_info lifecycle management
- amdgpu_vm_get_task_info: reference counts up task_info before
returning this info
- amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

V3: Fixed review comments from Felix
- Fix wrong indentation
- No debug message for -ENOMEM
- Add NULL check for task_info
- Do not duplicate the debug messages (ti vs no ti)
- Get first reference of task_info in vm_init(), put last
in vm_fini()

V4: Fixed review comments from Felix
- fix double reference increment in create_task_info
- change amdgpu_vm_get_task_info_pasid
- additional changes in amdgpu_gem.c while porting

Cc: Christian Koenig <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# feb13f52 29-Feb-2024 Jesse Zhang <[email protected]>

Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven

fix the issue:
"amdgpu: Failed to create process VM object".

[Why]when amdgpu initialized, seq64 do mampping and up

Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven

fix the issue:
"amdgpu: Failed to create process VM object".

[Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table.
But when clifo run. It also initializes a vm for a process device through the function kfd_process_device_init_vm and ensure the root PD is clean through the function amdgpu_vm_pt_is_root_clean.
So they have a conflict, and clinfo always failed.

v1:
- remove all the pte_supports_ats stuff from the amdgpu_vm code (Felix)

Signed-off-by: Jesse Zhang <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# ceb9a321 08-Dec-2023 Christian König <[email protected]>

drm/amdgpu: fix tear down order in amdgpu_vm_pt_free

When freeing PD/PT with shadows it can happen that the shadow
destruction races with detaching the PD/PT from the VM causing a NULL
pointer deref

drm/amdgpu: fix tear down order in amdgpu_vm_pt_free

When freeing PD/PT with shadows it can happen that the shadow
destruction races with detaching the PD/PT from the VM causing a NULL
pointer dereference in the invalidation code.

Fix this by detaching the the PD/PT from the VM first and then
freeing the shadow instead.

Signed-off-by: Christian König <[email protected]>
Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/2867
Cc: <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 683b8c7e 08-Dec-2023 Christian König <[email protected]>

drm/amdgpu: fix tear down order in amdgpu_vm_pt_free

When freeing PD/PT with shadows it can happen that the shadow
destruction races with detaching the PD/PT from the VM causing a NULL
pointer deref

drm/amdgpu: fix tear down order in amdgpu_vm_pt_free

When freeing PD/PT with shadows it can happen that the shadow
destruction races with detaching the PD/PT from the VM causing a NULL
pointer dereference in the invalidation code.

Fix this by detaching the the PD/PT from the VM first and then
freeing the shadow instead.

Signed-off-by: Christian König <[email protected]>
Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/2867
Cc: <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 142262a1 12-Oct-2023 David Francis <[email protected]>

drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems

On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and
MTYPE_UC for nonlocal memory.

On NUMA systems, local memory gets the loc

drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems

On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and
MTYPE_UC for nonlocal memory.

On NUMA systems, local memory gets the local mtype, set by an
override callback. If EXT_COHERENT is set, memory will be set as
MTYPE_UC by default, with local memory MTYPE_CC.

Add an option in the override function for this case, and
add a check to ensure it is not used on UNCACHED memory.

V2: Combined APU and NUMA code into one patch
V3: Fixed a potential nullptr in amdgpu_vm_bo_update

Signed-off-by: David Francis <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# fdac8909 28-Sep-2023 Philip Yang <[email protected]>

drm/amdgpu: ratelimited override pte flags messages

Use ratelimited version of dev_dbg to avoid flooding dmesg log. No
functional change.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-b

drm/amdgpu: ratelimited override pte flags messages

Use ratelimited version of dev_dbg to avoid flooding dmesg log. No
functional change.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f135b0fc 20-Jul-2023 Yang Li <[email protected]>

drm/amdgpu: Fix one kernel-doc comment

Use colon to separate parameter name from their specific meaning.
silence the warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:793: warning: Function parame

drm/amdgpu: Fix one kernel-doc comment

Use colon to separate parameter name from their specific meaning.
silence the warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:793: warning: Function parameter or member 'adev' not described in 'amdgpu_vm_pte_update_noretry_flags'

Reviewed-by: Randy Dunlap <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 8e781271 13-Jul-2023 Guchun Chen <[email protected]>

drm/amdgpu/vm: use the same xcp_id from root PD

Other PDs/PTs allocation should just use the same xcp_id as that
stored in root PD.

Suggested-by: Christian König <[email protected]>
Signed-o

drm/amdgpu/vm: use the same xcp_id from root PD

Other PDs/PTs allocation should just use the same xcp_id as that
stored in root PD.

Suggested-by: Christian König <[email protected]>
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 8ecee4cb 13-Jul-2023 Guchun Chen <[email protected]>

drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create

Recent code set xcp_id stored from file private data when opening
device to amdgpu bo for accounting memory usage etc, but not all
VMs

drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create

Recent code set xcp_id stored from file private data when opening
device to amdgpu bo for accounting memory usage etc, but not all
VMs are attached to this fpriv structure like the vm cases in
amdgpu_mes_self_test, otherwise, KASAN will complain below out
of bound access. And more importantly, VM code should not touch
fpriv structure, so drop fpriv code handling from amdgpu_vm_pt.

[ 77.292314] BUG: KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.293845] Read of size 4 at addr ffff888102c48a48 by task modprobe/1069
[ 77.294146] Call Trace:
[ 77.294178] <TASK>
[ 77.294208] dump_stack_lvl+0x49/0x63
[ 77.294260] print_report+0x16f/0x4a6
[ 77.294307] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.295979] ? kasan_complete_mode_report_info+0x3c/0x200
[ 77.296057] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.297556] kasan_report+0xb4/0x130
[ 77.297609] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.299202] __asan_load4+0x6f/0x90
[ 77.299272] amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.300796] ? amdgpu_init+0x6e/0x1000 [amdgpu]
[ 77.302222] ? amdgpu_vm_pt_clear+0x750/0x750 [amdgpu]
[ 77.303721] ? preempt_count_sub+0x18/0xc0
[ 77.303786] amdgpu_vm_init+0x39e/0x870 [amdgpu]
[ 77.305186] ? amdgpu_vm_wait_idle+0x90/0x90 [amdgpu]
[ 77.306683] ? kasan_set_track+0x25/0x30
[ 77.306737] ? kasan_save_alloc_info+0x1b/0x30
[ 77.306795] ? __kasan_kmalloc+0x87/0xa0
[ 77.306852] amdgpu_mes_self_test+0x169/0x620 [amdgpu]

v2: without specifying xcp partition for PD/PT bo, the xcp id is -1.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2686
Fixes: 3ebfd221c1a8 ("drm/amdkfd: Store xcp partition id to amdgpu bo")
Signed-off-by: Guchun Chen <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e379b5e7 13-Jul-2023 Guchun Chen <[email protected]>

drm/amdgpu/vm: use the same xcp_id from root PD

Other PDs/PTs allocation should just use the same xcp_id as that
stored in root PD.

Suggested-by: Christian König <[email protected]>
Signed-o

drm/amdgpu/vm: use the same xcp_id from root PD

Other PDs/PTs allocation should just use the same xcp_id as that
stored in root PD.

Suggested-by: Christian König <[email protected]>
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 5003ca63 13-Jul-2023 Guchun Chen <[email protected]>

drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create

Recent code set xcp_id stored from file private data when opening
device to amdgpu bo for accounting memory usage etc, but not all
VMs

drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create

Recent code set xcp_id stored from file private data when opening
device to amdgpu bo for accounting memory usage etc, but not all
VMs are attached to this fpriv structure like the vm cases in
amdgpu_mes_self_test, otherwise, KASAN will complain below out
of bound access. And more importantly, VM code should not touch
fpriv structure, so drop fpriv code handling from amdgpu_vm_pt.

[ 77.292314] BUG: KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.293845] Read of size 4 at addr ffff888102c48a48 by task modprobe/1069
[ 77.294146] Call Trace:
[ 77.294178] <TASK>
[ 77.294208] dump_stack_lvl+0x49/0x63
[ 77.294260] print_report+0x16f/0x4a6
[ 77.294307] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.295979] ? kasan_complete_mode_report_info+0x3c/0x200
[ 77.296057] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.297556] kasan_report+0xb4/0x130
[ 77.297609] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.299202] __asan_load4+0x6f/0x90
[ 77.299272] amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu]
[ 77.300796] ? amdgpu_init+0x6e/0x1000 [amdgpu]
[ 77.302222] ? amdgpu_vm_pt_clear+0x750/0x750 [amdgpu]
[ 77.303721] ? preempt_count_sub+0x18/0xc0
[ 77.303786] amdgpu_vm_init+0x39e/0x870 [amdgpu]
[ 77.305186] ? amdgpu_vm_wait_idle+0x90/0x90 [amdgpu]
[ 77.306683] ? kasan_set_track+0x25/0x30
[ 77.306737] ? kasan_save_alloc_info+0x1b/0x30
[ 77.306795] ? __kasan_kmalloc+0x87/0xa0
[ 77.306852] amdgpu_mes_self_test+0x169/0x620 [amdgpu]

v2: without specifying xcp partition for PD/PT bo, the xcp id is -1.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2686
Fixes: 3ebfd221c1a8 ("drm/amdkfd: Store xcp partition id to amdgpu bo")
Signed-off-by: Guchun Chen <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# eb58ad14 30-Jun-2023 Xiaogang Chen <[email protected]>

drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm

When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and km

drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm

When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and kmap them to kernel virtual
address space before kfd updates the vm that was created by gfx.

Signed-off-by: Xiaogang Chen <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e77673d1 09-Jun-2023 Mukul Joshi <[email protected]>

drm/amdgpu: Update invalid PTE flag setting

Update the invalid PTE flag setting with TF enabled.
This is to ensure, in addition to transitioning the
retry fault to a no-retry fault, it also causes t

drm/amdgpu: Update invalid PTE flag setting

Update the invalid PTE flag setting with TF enabled.
This is to ensure, in addition to transitioning the
retry fault to a no-retry fault, it also causes the
wavefront to enter the trap handler. With the current
setting, the fault only transitions to a no-retry fault.
Additionally, have 2 sets of invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.

Signed-off-by: Mukul Joshi <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# cbb63ecc 29-May-2023 Horatio Zhang <[email protected]>

drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram

Use the function of amdgpu_bo_vm_destroy to handle the resource release
of shadow bo. During the amdgpu_mes_self_test, sh

drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram

Use the function of amdgpu_bo_vm_destroy to handle the resource release
of shadow bo. During the amdgpu_mes_self_test, shadow bo released, but
vmbo->shadow_list was not, which caused a null pointer reference error
in amdgpu_device_recover_vram when GPU reset.

Fixes: 6c032c37ac3e ("drm/amdgpu: Fix vram recover doesn't work after whole GPU reset (v2)")
Signed-off-by: xinhui pan <[email protected]>
Signed-off-by: Horatio Zhang <[email protected]>
Acked-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


12