|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
1092a4ea |
| 03-Mar-2025 |
Dr. David Alan Gilbert <[email protected]> |
drm/amdgpu: Remove unused pqm_get_kernel_queue
pqm_get_kernel_queue() has been unused since 2022's commit 5bdd3eb25354 ("drm/amdkfd: Remove unused old debugger implementation")
Remove it.
Signed-o
drm/amdgpu: Remove unused pqm_get_kernel_queue
pqm_get_kernel_queue() has been unused since 2022's commit 5bdd3eb25354 ("drm/amdkfd: Remove unused old debugger implementation")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4 |
|
| #
7919b4ca |
| 20-Feb-2025 |
Philip Yang <[email protected]> |
drm/amdkfd: Fix pqm_destroy_queue race with GPU reset
If GPU in reset, destroy_queue return -EIO, pqm_destroy_queue should delete the queue from process_queue_list and free the resource.
Signed-off
drm/amdkfd: Fix pqm_destroy_queue race with GPU reset
If GPU in reset, destroy_queue return -EIO, pqm_destroy_queue should delete the queue from process_queue_list and free the resource.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fddc4502 |
| 24-Feb-2025 |
Srinivasan Shanmugam <[email protected]> |
drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'
This commit addresses a circular locking dependency in the svm_range_cpu_invalidate_pagetables function. The func
drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'
This commit addresses a circular locking dependency in the svm_range_cpu_invalidate_pagetables function. The function previously held a lock while determining whether to perform an unmap or eviction operation, which could lead to deadlocks.
Fixes the below:
[ 223.418794] ====================================================== [ 223.418820] WARNING: possible circular locking dependency detected [ 223.418845] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE [ 223.418869] ------------------------------------------------------ [ 223.418889] kfdtest/3939 is trying to acquire lock: [ 223.418906] ffff8957552eae38 (&dqm->lock_hidden){+.+.}-{3:3}, at: evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.419302] but task is already holding lock: [ 223.419303] ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu] [ 223.419447] Console: switching to colour dummy device 80x25 [ 223.419477] [IGT] amd_basic: executing [ 223.419599] which lock already depends on the new lock.
[ 223.419611] the existing dependency chain (in reverse order) is: [ 223.419621] -> #2 (&prange->lock){+.+.}-{3:3}: [ 223.419636] __mutex_lock+0x85/0xe20 [ 223.419647] mutex_lock_nested+0x1b/0x30 [ 223.419656] svm_range_validate_and_map+0x2f1/0x15b0 [amdgpu] [ 223.419954] svm_range_set_attr+0xe8c/0x1710 [amdgpu] [ 223.420236] svm_ioctl+0x46/0x50 [amdgpu] [ 223.420503] kfd_ioctl_svm+0x50/0x90 [amdgpu] [ 223.420763] kfd_ioctl+0x409/0x6d0 [amdgpu] [ 223.421024] __x64_sys_ioctl+0x95/0xd0 [ 223.421036] x64_sys_call+0x1205/0x20d0 [ 223.421047] do_syscall_64+0x87/0x140 [ 223.421056] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.421068] -> #1 (reservation_ww_class_mutex){+.+.}-{3:3}: [ 223.421084] __ww_mutex_lock.constprop.0+0xab/0x1560 [ 223.421095] ww_mutex_lock+0x2b/0x90 [ 223.421103] amdgpu_amdkfd_alloc_gtt_mem+0xcc/0x2b0 [amdgpu] [ 223.421361] add_queue_mes+0x3bc/0x440 [amdgpu] [ 223.421623] unhalt_cpsch+0x1ae/0x240 [amdgpu] [ 223.421888] kgd2kfd_start_sched+0x5e/0xd0 [amdgpu] [ 223.422148] amdgpu_amdkfd_start_sched+0x3d/0x50 [amdgpu] [ 223.422414] amdgpu_gfx_enforce_isolation_handler+0x132/0x270 [amdgpu] [ 223.422662] process_one_work+0x21e/0x680 [ 223.422673] worker_thread+0x190/0x330 [ 223.422682] kthread+0xe7/0x120 [ 223.422690] ret_from_fork+0x3c/0x60 [ 223.422699] ret_from_fork_asm+0x1a/0x30 [ 223.422708] -> #0 (&dqm->lock_hidden){+.+.}-{3:3}: [ 223.422723] __lock_acquire+0x16f4/0x2810 [ 223.422734] lock_acquire+0xd1/0x300 [ 223.422742] __mutex_lock+0x85/0xe20 [ 223.422751] mutex_lock_nested+0x1b/0x30 [ 223.422760] evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.423025] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu] [ 223.423285] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu] [ 223.423540] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu] [ 223.423807] __mmu_notifier_invalidate_range_start+0x1f5/0x250 [ 223.423819] copy_page_range+0x1e94/0x1ea0 [ 223.423829] copy_process+0x172f/0x2ad0 [ 223.423839] kernel_clone+0x9c/0x3f0 [ 223.423847] __do_sys_clone+0x66/0x90 [ 223.423856] __x64_sys_clone+0x25/0x30 [ 223.423864] x64_sys_call+0x1d7c/0x20d0 [ 223.423872] do_syscall_64+0x87/0x140 [ 223.423880] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.423891] other info that might help us debug this:
[ 223.423903] Chain exists of: &dqm->lock_hidden --> reservation_ww_class_mutex --> &prange->lock
[ 223.423926] Possible unsafe locking scenario:
[ 223.423935] CPU0 CPU1 [ 223.423942] ---- ---- [ 223.423949] lock(&prange->lock); [ 223.423958] lock(reservation_ww_class_mutex); [ 223.423970] lock(&prange->lock); [ 223.423981] lock(&dqm->lock_hidden); [ 223.423990] *** DEADLOCK ***
[ 223.423999] 5 locks held by kfdtest/3939: [ 223.424006] #0: ffffffffb82b4fc0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1387/0x2ad0 [ 223.424026] #1: ffff89575eda81b0 (&mm->mmap_lock){++++}-{3:3}, at: copy_process+0x13a8/0x2ad0 [ 223.424046] #2: ffff89575edaf3b0 (&mm->mmap_lock/1){+.+.}-{3:3}, at: copy_process+0x13e4/0x2ad0 [ 223.424066] #3: ffffffffb82e76e0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: copy_page_range+0x1cea/0x1ea0 [ 223.424088] #4: ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu] [ 223.424365] stack backtrace: [ 223.424374] CPU: 0 UID: 0 PID: 3939 Comm: kfdtest Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14 [ 223.424392] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 223.424401] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO WIFI/X570 AORUS PRO WIFI, BIOS F36a 02/16/2022 [ 223.424416] Call Trace: [ 223.424423] <TASK> [ 223.424430] dump_stack_lvl+0x9b/0xf0 [ 223.424441] dump_stack+0x10/0x20 [ 223.424449] print_circular_bug+0x275/0x350 [ 223.424460] check_noncircular+0x157/0x170 [ 223.424469] ? __bfs+0xfd/0x2c0 [ 223.424481] __lock_acquire+0x16f4/0x2810 [ 223.424490] ? srso_return_thunk+0x5/0x5f [ 223.424505] lock_acquire+0xd1/0x300 [ 223.424514] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.424783] __mutex_lock+0x85/0xe20 [ 223.424792] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425058] ? srso_return_thunk+0x5/0x5f [ 223.425067] ? mark_held_locks+0x54/0x90 [ 223.425076] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425339] ? srso_return_thunk+0x5/0x5f [ 223.425350] mutex_lock_nested+0x1b/0x30 [ 223.425358] ? mutex_lock_nested+0x1b/0x30 [ 223.425367] evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425631] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu] [ 223.425893] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu] [ 223.426156] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu] [ 223.426423] ? srso_return_thunk+0x5/0x5f [ 223.426436] __mmu_notifier_invalidate_range_start+0x1f5/0x250 [ 223.426450] copy_page_range+0x1e94/0x1ea0 [ 223.426461] ? srso_return_thunk+0x5/0x5f [ 223.426474] ? srso_return_thunk+0x5/0x5f [ 223.426484] ? lock_acquire+0xd1/0x300 [ 223.426494] ? copy_process+0x1718/0x2ad0 [ 223.426502] ? srso_return_thunk+0x5/0x5f [ 223.426510] ? sched_clock_noinstr+0x9/0x10 [ 223.426519] ? local_clock_noinstr+0xe/0xc0 [ 223.426528] ? copy_process+0x1718/0x2ad0 [ 223.426537] ? srso_return_thunk+0x5/0x5f [ 223.426550] copy_process+0x172f/0x2ad0 [ 223.426569] kernel_clone+0x9c/0x3f0 [ 223.426577] ? __schedule+0x4c9/0x1b00 [ 223.426586] ? srso_return_thunk+0x5/0x5f [ 223.426594] ? sched_clock_noinstr+0x9/0x10 [ 223.426602] ? srso_return_thunk+0x5/0x5f [ 223.426610] ? local_clock_noinstr+0xe/0xc0 [ 223.426619] ? schedule+0x107/0x1a0 [ 223.426629] __do_sys_clone+0x66/0x90 [ 223.426643] __x64_sys_clone+0x25/0x30 [ 223.426652] x64_sys_call+0x1d7c/0x20d0 [ 223.426661] do_syscall_64+0x87/0x140 [ 223.426671] ? srso_return_thunk+0x5/0x5f [ 223.426679] ? common_nsleep+0x44/0x50 [ 223.426690] ? srso_return_thunk+0x5/0x5f [ 223.426698] ? trace_hardirqs_off+0x52/0xd0 [ 223.426709] ? srso_return_thunk+0x5/0x5f [ 223.426717] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426727] ? srso_return_thunk+0x5/0x5f [ 223.426736] ? do_syscall_64+0x93/0x140 [ 223.426748] ? srso_return_thunk+0x5/0x5f [ 223.426756] ? up_write+0x1c/0x1e0 [ 223.426765] ? srso_return_thunk+0x5/0x5f [ 223.426775] ? srso_return_thunk+0x5/0x5f [ 223.426783] ? trace_hardirqs_off+0x52/0xd0 [ 223.426792] ? srso_return_thunk+0x5/0x5f [ 223.426800] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426810] ? srso_return_thunk+0x5/0x5f [ 223.426818] ? do_syscall_64+0x93/0x140 [ 223.426826] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426836] ? srso_return_thunk+0x5/0x5f [ 223.426844] ? do_syscall_64+0x93/0x140 [ 223.426853] ? srso_return_thunk+0x5/0x5f [ 223.426861] ? irqentry_exit+0x6b/0x90 [ 223.426869] ? srso_return_thunk+0x5/0x5f [ 223.426877] ? exc_page_fault+0xa7/0x2c0 [ 223.426888] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.426898] RIP: 0033:0x7f46758eab57 [ 223.426906] Code: ba 04 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 41 89 c0 85 c0 75 2c 64 48 8b 04 25 10 00 [ 223.426930] RSP: 002b:00007fff5c3e5188 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 223.426943] RAX: ffffffffffffffda RBX: 00007f4675f8c040 RCX: 00007f46758eab57 [ 223.426954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 223.426965] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 223.426975] R10: 00007f4675e81a50 R11: 0000000000000246 R12: 0000000000000001 [ 223.426986] R13: 00007fff5c3e5470 R14: 00007fff5c3e53e0 R15: 00007fff5c3e5410 [ 223.427004] </TASK>
v2: To resolve this issue, the allocation of the process context buffer (`proc_ctx_bo`) has been moved from the `add_queue_mes` function to the `pqm_create_queue` function. This change ensures that the buffer is allocated only when the first queue for a process is created and only if the Micro Engine Scheduler (MES) is enabled. (Felix)
v3: Fix typo s/Memory Execution Scheduler (MES)/Micro Engine Scheduler in commit message. (Lijo)
Fixes: 438b39ac74e2 ("drm/amdkfd: pause autosuspend when creating pdd") Cc: Jesse Zhang <[email protected]> Cc: Yunxiang Li <[email protected]> Cc: Philip Yang <[email protected]> Cc: Alex Sierra <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
8544374c |
| 13-Jan-2025 |
Xiaogang Chen <[email protected]> |
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kf
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid values that graphic driver generated which is per vm per pasid.
These pasid values are passed to fw/hardware. We do not need change interrupt handler though more pasid values are used. Also, pasid values at log are replaced by user process pid; pasid values are not exposed to user. Users see their process pids that have meaning in user space.
Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a33f7f96 |
| 26-Jan-2025 |
Zhu Lingshan <[email protected]> |
amdkfd: properly free gang_ctx_bo when failed to init user queue
The destructor of a gtt bo is declared as void amdgpu_amdkfd_free_gtt_mem(struct amdgpu_device *adev, void **mem_obj); Which takes vo
amdkfd: properly free gang_ctx_bo when failed to init user queue
The destructor of a gtt bo is declared as void amdgpu_amdkfd_free_gtt_mem(struct amdgpu_device *adev, void **mem_obj); Which takes void** as the second parameter.
GCC allows passing void* to the function because void* can be implicitly casted to any other types, so it can pass compiling.
However, passing this void* parameter into the function's execution process(which expects void** and dereferencing void**) will result in errors.
Signed-off-by: Zhu Lingshan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Fixes: fb91065851cd ("drm/amdkfd: Refactor queue wptr_bo GART mapping") Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
9078a5bf |
| 14-Jan-2025 |
Prike Liang <[email protected]> |
drm/amdkfd: only flush the validate MES contex
The following page fault was observed duringthe KFD process release. In this particular error case, the HIP test (./MemcpyPerformance -h) does not requ
drm/amdkfd: only flush the validate MES contex
The following page fault was observed duringthe KFD process release. In this particular error case, the HIP test (./MemcpyPerformance -h) does not require the queue. As a result, the process_context_addr was not assigned when the KFD process was released, ultimately leading to this page fault during the execution of the function kfd_process_dequeue_from_all_devices().
[345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [345962.295775] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
Signed-off-by: Prike Liang <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
71985559 |
| 21-Feb-2024 |
Alex Sierra <[email protected]> |
drm/amdkfd: add gc 9.5.0 support on kfd
Initial support for GC 9.5.0.
v2: squash in pqm_clean_queue_resource() fix from Lijo
Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kue
drm/amdkfd: add gc 9.5.0 support on kfd
Initial support for GC 9.5.0.
v2: squash in pqm_clean_queue_resource() fix from Lijo
Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a592bb19 |
| 26-Nov-2024 |
Andrew Martin <[email protected]> |
drm/amdkfd: Dereference null return value
In the function pqm_uninit there is a call-assignment of "pdd = kfd_get_process_device_data" which could be null, and this value was later dereferenced with
drm/amdkfd: Dereference null return value
In the function pqm_uninit there is a call-assignment of "pdd = kfd_get_process_device_data" which could be null, and this value was later dereferenced without checking.
Fixes: fb91065851cd ("drm/amdkfd: Refactor queue wptr_bo GART mapping") Signed-off-by: Andrew Martin <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
show more ...
|
| #
bc4688ae |
| 23-Sep-2024 |
Lang Yu <[email protected]> |
drm/amdkfd: Remove an unused parameter in queue creation
struct file *f is unused in queue creation, remove it.
Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <felix.kuehling@
drm/amdkfd: Remove an unused parameter in queue creation
struct file *f is unused in queue creation, remove it.
Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
21d1d724 |
| 10-Sep-2024 |
Kent Russell <[email protected]> |
drm/amdkfd: Move queue fs deletion after destroy check
We were removing the kernfs entry for queue info before checking if the queue could be destroyed. If it failed to get destroyed (e.g. during so
drm/amdkfd: Move queue fs deletion after destroy check
We were removing the kernfs entry for queue info before checking if the queue could be destroyed. If it failed to get destroyed (e.g. during some GPU resets), then we would try to delete it later during pqm teardown, but the file was already removed. This led to a kernel WARN trying to remove size, gpuid and type. Move the remove to after the destroy check.
Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
aa47fe8d |
| 06-Sep-2024 |
Jesse Zhang <[email protected]> |
drm/amdkfd: Fix resource leak in criu restore queue
To avoid memory leaks, release q_extra_data when exiting the restore queue. v2: Correct the proto (Alex)
Signed-off-by: Jesse Zhang <jesse.zhang@
drm/amdkfd: Fix resource leak in criu restore queue
To avoid memory leaks, release q_extra_data when exiting the restore queue. v2: Correct the proto (Alex)
Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a1fc9f58 |
| 02-Aug-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Handle queue destroy buffer access race
Add helper function kfd_queue_unreference_buffers to reduce queue buffer refcount, separate it from release queue buffers.
Because it is circular
drm/amdkfd: Handle queue destroy buffer access race
Add helper function kfd_queue_unreference_buffers to reduce queue buffer refcount, separate it from release queue buffers.
Because it is circular locking to hold dqm_lock to take vm lock, kfd_ioctl_destroy_queue should take vm lock, unreference queue buffers first, but not release queue buffers, to handle error in case failed to hold vm lock. Then hold dqm_lock to remove queue from queue list and then release queue buffers.
Restore process worker restore queue hold dqm_lock, will always find the queue with valid queue buffers.
v2 (Felix): - renamed kfd_queue_unreference_buffer(s) to kfd_queue_unref_bo_va(s) - added two FIXME comments for follow up
Signed-off-by: Philip Yang <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e06b71b2 |
| 21-May-2024 |
Jonathan Kim <[email protected]> |
drm/amdkfd: allow users to target recommended SDMA engines
Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to cr
drm/amdkfd: allow users to target recommended SDMA engines
Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to create SDMA queues on these recommended engines. Close to 2x overall performance has been observed with this optimization.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
305cd109 |
| 20-Jun-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Validate user queue update
Ensure update queue new ring buffer is mapped on GPU with correct size.
Decrease queue old ring_bo queue_refcount and increase new ring_bo queue_refcount.
Si
drm/amdkfd: Validate user queue update
Ensure update queue new ring buffer is mapped on GPU with correct size.
Decrease queue old ring_bo queue_refcount and increase new ring_bo queue_refcount.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fb910658 |
| 20-Jun-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Refactor queue wptr_bo GART mapping
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo reference from queue write_ptr if it is mapped to the KFD node with expected size.
drm/amdkfd: Refactor queue wptr_bo GART mapping
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo reference from queue write_ptr if it is mapped to the KFD node with expected size.
Add wptr_bo to structure queue_properties because structure queue is allocated after queue buffers are validated, then we can remove wptr_bo parameter from pqm_create_queue.
Rename structure queue wptr_bo_gart to hold wptr_bo reference for GART mapping and umapping. Move MES wptr_bo_gart mapping to init_user_queue, the same location with queue ctx_bo GART mapping.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
c86ad391 |
| 14-Jul-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original poi
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d225960c |
| 03-Jun-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: add lock in kfd_process_dequeue_from_device
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for
drm/amdgpu: add lock in kfd_process_dequeue_from_device
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
1802b042 |
| 24-May-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu/kfd: remove is_hws_hang and is_resetting
is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that direct
drm/amdgpu/kfd: remove is_hws_hang and is_resetting
is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that directly instead. This also eliminate a few bugs listed below and get rid of dqm->ops.pre_reset.
kfd_hws_hang did not need to avoid scheduling another reset. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway.
remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5f571c61 |
| 16-Apr-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Add gfx v9_4_4 ip block
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <alexander.de
drm/amdgpu: Add gfx v9_4_4 ip block
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4, v6.8-rc3 |
|
| #
e3de58f8 |
| 01-Feb-2024 |
Rajneesh Bhardwaj <[email protected]> |
drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards
In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Se
drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards
In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions.
Reviewed-by: Felix Kuehling <[email protected]> Suggested-by: Joseph Greathouse <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5869b32b |
| 01-Feb-2024 |
Rajneesh Bhardwaj <[email protected]> |
drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards
In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Se
drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards
In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions.
Reviewed-by: Felix Kuehling <[email protected]> Suggested-by: Joseph Greathouse <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
24149412 |
| 14-Dec-2023 |
Jonathan Kim <[email protected]> |
drm/amdkfd: only flush mes process context if mes support is there
Fix up on mes process context flush to prevent non-mes devices from spamming error messages or running into undefined behaviour dur
drm/amdkfd: only flush mes process context if mes support is there
Fix up on mes process context flush to prevent non-mes devices from spamming error messages or running into undefined behaviour during process termination.
Fixes: bd33bb1409b4 ("drm/amdkfd: fix mes set shader debugger process management") Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Eric Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5 |
|
| #
bd33bb14 |
| 05-Dec-2023 |
Jonathan Kim <[email protected]> |
drm/amdkfd: fix mes set shader debugger process management
MES provides the driver a call to explicitly flush stale process memory within the MES to avoid a race condition that results in a fatal me
drm/amdkfd: fix mes set shader debugger process management
MES provides the driver a call to explicitly flush stale process memory within the MES to avoid a race condition that results in a fatal memory violation.
When SET_SHADER_DEBUGGER is called, the driver passes a memory address that represents a process context address MES uses to keep track of future per-process calls.
Normally, MES will purge its process context list when the last queue has been removed. The driver, however, can call SET_SHADER_DEBUGGER regardless of whether a queue has been added or not.
If SET_SHADER_DEBUGGER has been called with no queues as the last call prior to process termination, the passed process context address will still reside within MES.
On a new process call to SET_SHADER_DEBUGGER, the driver may end up passing an identical process context address value (based on per-process gpu memory address) to MES but is now pointing to a new allocated buffer object during KFD process creation. Since the MES is unaware of this, access of the passed address points to the stale object within MES and triggers a fatal memory violation.
The solution is for KFD to explicitly flush the process context address from MES on process termination.
Note that the flush call and the MES debugger calls use the same MES interface but are separated as KFD calls to avoid conflicting with each other.
Signed-off-by: Jonathan Kim <[email protected]> Tested-by: Alice Wong <[email protected]> Reviewed-by: Eric Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
72838777 |
| 06-Nov-2023 |
ZhenGuo Yin <[email protected]> |
drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
[Why] Memory leaks of gang_ctx_bo and wptr_bo.
[How] Free gang_ctx_bo and wptr_bo in pqm_uninit.
v2: add a common function pqm_clean_queue_re
drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
[Why] Memory leaks of gang_ctx_bo and wptr_bo.
[How] Free gang_ctx_bo and wptr_bo in pqm_uninit.
v2: add a common function pqm_clean_queue_resource to free queue's resources. v3: reset pdd->pqd.num_gws when destorying GWS queue.
Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: ZhenGuo Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d581ceab |
| 06-Nov-2023 |
ZhenGuo Yin <[email protected]> |
drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
[Why] Memory leaks of gang_ctx_bo and wptr_bo.
[How] Free gang_ctx_bo and wptr_bo in pqm_uninit.
v2: add a common function pqm_clean_queue_re
drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
[Why] Memory leaks of gang_ctx_bo and wptr_bo.
[How] Free gang_ctx_bo and wptr_bo in pqm_uninit.
v2: add a common function pqm_clean_queue_resource to free queue's resources. v3: reset pdd->pqd.num_gws when destorying GWS queue.
Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: ZhenGuo Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|