|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
e02fcf73 |
| 14-Mar-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/sdma: fix engine reset handling
Move the kfd suspend/resume code into the caller. That is where the KFD is likely to detect a reset so on the KFD side there is no need to call them. Als
drm/amdgpu/sdma: fix engine reset handling
Move the kfd suspend/resume code into the caller. That is where the KFD is likely to detect a reset so on the KFD side there is no need to call them. Also add a mutex to lock the actual reset sequence.
v2: make the locking per instance
Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+") Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
0882ca4e |
| 05-Mar-2025 |
Yifan Zha <[email protected]> |
drm/amd/amdkfd: Evict all queues even HWS remove queue failed
[Why] If reset is detected and kfd need to evict working queues, HWS moving queue will be failed. Then remaining queues are not evicted
drm/amd/amdkfd: Evict all queues even HWS remove queue failed
[Why] If reset is detected and kfd need to evict working queues, HWS moving queue will be failed. Then remaining queues are not evicted and in active state.
After reset done, kfd uses HWS to termination remaining activated queues but HWS is resetted. So remove queue will be failed again.
[How] Keep removing all queues even if HWS returns failed. It will not affect cpsch as it checks reset_domain->sem.
v2: If any queue failed, evict queue returns error. v3: Declare err inside the if-block.
Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Yifan Zha <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 42c854b8fb0cce512534aa2b7141948e80c6ebb0) Cc: [email protected]
show more ...
|
| #
42c854b8 |
| 05-Mar-2025 |
Yifan Zha <[email protected]> |
drm/amd/amdkfd: Evict all queues even HWS remove queue failed
[Why] If reset is detected and kfd need to evict working queues, HWS moving queue will be failed. Then remaining queues are not evicted
drm/amd/amdkfd: Evict all queues even HWS remove queue failed
[Why] If reset is detected and kfd need to evict working queues, HWS moving queue will be failed. Then remaining queues are not evicted and in active state.
After reset done, kfd uses HWS to termination remaining activated queues but HWS is resetted. So remove queue will be failed again.
[How] Keep removing all queues even if HWS returns failed. It will not affect cpsch as it checks reset_domain->sem.
v2: If any queue failed, evict queue returns error. v3: Declare err inside the if-block.
Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Yifan Zha <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
ed962f8d |
| 11-Feb-2025 |
Harish Kasiviswanathan <[email protected]> |
drm/amdkfd: Add pm_config_dequeue_wait_counts API
Update pm_update_grace_period() to more cleaner pm_config_dequeue_wait_counts(). Previously, grace_period variable was overloaded as a variable and
drm/amdkfd: Add pm_config_dequeue_wait_counts API
Update pm_update_grace_period() to more cleaner pm_config_dequeue_wait_counts(). Previously, grace_period variable was overloaded as a variable and a macro, making it inflexible to configure additional dequeue wait times.
pm_config_dequeue_wait_counts() now takes in a cmd / variable. This allows flexibility to update different dequeue wait times.
Signed-off-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
cf6d949a |
| 14-Jan-2025 |
Harish Kasiviswanathan <[email protected]> |
drm/amdkfd: Add support for more per-process flag
Add support for more per-process flags starting with option to configure MFMA precision for gfx 9.5
v2: Change flag name to KFD_PROC_FLAG_MFMA_HIGH
drm/amdkfd: Add support for more per-process flag
Add support for more per-process flags starting with option to configure MFMA precision for gfx 9.5
v2: Change flag name to KFD_PROC_FLAG_MFMA_HIGH_PRECISION Remove unused else condition v3: Bump the KFD API version v4: Missed SH_MEM_CONFIG__PRECISION_MODE__SHIFT define. Added it.
Signed-off-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Amber Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
289e6850 |
| 14-Jan-2025 |
Harish Kasiviswanathan <[email protected]> |
drm/amdkfd: Set per-process flags only once cik/vi
Set per-process static sh_mem config only once during process initialization. Move all static changes from update_qpd() which is called each time a
drm/amdkfd: Set per-process flags only once cik/vi
Set per-process static sh_mem config only once during process initialization. Move all static changes from update_qpd() which is called each time a queue is created to set_cache_memory_policy() which is called once during process initialization.
set_cache_memory_policy() is currently defined only for cik and vi family. So this commit only focuses on these two. A separate commit will address other asics.
Signed-off-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Amber Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
bac38ca8 |
| 15-Jan-2025 |
Jonathan Kim <[email protected]> |
drm/amdkfd: implement per queue sdma reset for gfx 9.4+
To reset hung SDMA queues on GFX 9.4+ for the GFX9 family, a soft reset must be issued through SMU. Since soft resets will reset an entire SD
drm/amdkfd: implement per queue sdma reset for gfx 9.4+
To reset hung SDMA queues on GFX 9.4+ for the GFX9 family, a soft reset must be issued through SMU. Since soft resets will reset an entire SDMA engine, use a common KGD call to do the reset as the KGD will handle avoiding a reset of in flight GFX and paging queues on that engine.
In addition, create a common call for all reset types to simplify the handling of module parameter settings that block gpu resets.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ee3ed100 |
| 25-Feb-2025 |
Philip Yang <[email protected]> |
drm/amdkfd: Remove kfd_process_hw_exception worker
With GPU reset-domain worker implemented, KFD hw_exception worker is not needed any more, just call amdgpu_amdkfd_gpu_reset directly from kfd_hws_h
drm/amdkfd: Remove kfd_process_hw_exception worker
With GPU reset-domain worker implemented, KFD hw_exception worker is not needed any more, just call amdgpu_amdkfd_gpu_reset directly from kfd_hws_hang.
Suggested-by: Felix Kuehling <[email protected]> Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fddc4502 |
| 24-Feb-2025 |
Srinivasan Shanmugam <[email protected]> |
drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'
This commit addresses a circular locking dependency in the svm_range_cpu_invalidate_pagetables function. The func
drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'
This commit addresses a circular locking dependency in the svm_range_cpu_invalidate_pagetables function. The function previously held a lock while determining whether to perform an unmap or eviction operation, which could lead to deadlocks.
Fixes the below:
[ 223.418794] ====================================================== [ 223.418820] WARNING: possible circular locking dependency detected [ 223.418845] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE [ 223.418869] ------------------------------------------------------ [ 223.418889] kfdtest/3939 is trying to acquire lock: [ 223.418906] ffff8957552eae38 (&dqm->lock_hidden){+.+.}-{3:3}, at: evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.419302] but task is already holding lock: [ 223.419303] ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu] [ 223.419447] Console: switching to colour dummy device 80x25 [ 223.419477] [IGT] amd_basic: executing [ 223.419599] which lock already depends on the new lock.
[ 223.419611] the existing dependency chain (in reverse order) is: [ 223.419621] -> #2 (&prange->lock){+.+.}-{3:3}: [ 223.419636] __mutex_lock+0x85/0xe20 [ 223.419647] mutex_lock_nested+0x1b/0x30 [ 223.419656] svm_range_validate_and_map+0x2f1/0x15b0 [amdgpu] [ 223.419954] svm_range_set_attr+0xe8c/0x1710 [amdgpu] [ 223.420236] svm_ioctl+0x46/0x50 [amdgpu] [ 223.420503] kfd_ioctl_svm+0x50/0x90 [amdgpu] [ 223.420763] kfd_ioctl+0x409/0x6d0 [amdgpu] [ 223.421024] __x64_sys_ioctl+0x95/0xd0 [ 223.421036] x64_sys_call+0x1205/0x20d0 [ 223.421047] do_syscall_64+0x87/0x140 [ 223.421056] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.421068] -> #1 (reservation_ww_class_mutex){+.+.}-{3:3}: [ 223.421084] __ww_mutex_lock.constprop.0+0xab/0x1560 [ 223.421095] ww_mutex_lock+0x2b/0x90 [ 223.421103] amdgpu_amdkfd_alloc_gtt_mem+0xcc/0x2b0 [amdgpu] [ 223.421361] add_queue_mes+0x3bc/0x440 [amdgpu] [ 223.421623] unhalt_cpsch+0x1ae/0x240 [amdgpu] [ 223.421888] kgd2kfd_start_sched+0x5e/0xd0 [amdgpu] [ 223.422148] amdgpu_amdkfd_start_sched+0x3d/0x50 [amdgpu] [ 223.422414] amdgpu_gfx_enforce_isolation_handler+0x132/0x270 [amdgpu] [ 223.422662] process_one_work+0x21e/0x680 [ 223.422673] worker_thread+0x190/0x330 [ 223.422682] kthread+0xe7/0x120 [ 223.422690] ret_from_fork+0x3c/0x60 [ 223.422699] ret_from_fork_asm+0x1a/0x30 [ 223.422708] -> #0 (&dqm->lock_hidden){+.+.}-{3:3}: [ 223.422723] __lock_acquire+0x16f4/0x2810 [ 223.422734] lock_acquire+0xd1/0x300 [ 223.422742] __mutex_lock+0x85/0xe20 [ 223.422751] mutex_lock_nested+0x1b/0x30 [ 223.422760] evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.423025] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu] [ 223.423285] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu] [ 223.423540] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu] [ 223.423807] __mmu_notifier_invalidate_range_start+0x1f5/0x250 [ 223.423819] copy_page_range+0x1e94/0x1ea0 [ 223.423829] copy_process+0x172f/0x2ad0 [ 223.423839] kernel_clone+0x9c/0x3f0 [ 223.423847] __do_sys_clone+0x66/0x90 [ 223.423856] __x64_sys_clone+0x25/0x30 [ 223.423864] x64_sys_call+0x1d7c/0x20d0 [ 223.423872] do_syscall_64+0x87/0x140 [ 223.423880] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.423891] other info that might help us debug this:
[ 223.423903] Chain exists of: &dqm->lock_hidden --> reservation_ww_class_mutex --> &prange->lock
[ 223.423926] Possible unsafe locking scenario:
[ 223.423935] CPU0 CPU1 [ 223.423942] ---- ---- [ 223.423949] lock(&prange->lock); [ 223.423958] lock(reservation_ww_class_mutex); [ 223.423970] lock(&prange->lock); [ 223.423981] lock(&dqm->lock_hidden); [ 223.423990] *** DEADLOCK ***
[ 223.423999] 5 locks held by kfdtest/3939: [ 223.424006] #0: ffffffffb82b4fc0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1387/0x2ad0 [ 223.424026] #1: ffff89575eda81b0 (&mm->mmap_lock){++++}-{3:3}, at: copy_process+0x13a8/0x2ad0 [ 223.424046] #2: ffff89575edaf3b0 (&mm->mmap_lock/1){+.+.}-{3:3}, at: copy_process+0x13e4/0x2ad0 [ 223.424066] #3: ffffffffb82e76e0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: copy_page_range+0x1cea/0x1ea0 [ 223.424088] #4: ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu] [ 223.424365] stack backtrace: [ 223.424374] CPU: 0 UID: 0 PID: 3939 Comm: kfdtest Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14 [ 223.424392] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 223.424401] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO WIFI/X570 AORUS PRO WIFI, BIOS F36a 02/16/2022 [ 223.424416] Call Trace: [ 223.424423] <TASK> [ 223.424430] dump_stack_lvl+0x9b/0xf0 [ 223.424441] dump_stack+0x10/0x20 [ 223.424449] print_circular_bug+0x275/0x350 [ 223.424460] check_noncircular+0x157/0x170 [ 223.424469] ? __bfs+0xfd/0x2c0 [ 223.424481] __lock_acquire+0x16f4/0x2810 [ 223.424490] ? srso_return_thunk+0x5/0x5f [ 223.424505] lock_acquire+0xd1/0x300 [ 223.424514] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.424783] __mutex_lock+0x85/0xe20 [ 223.424792] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425058] ? srso_return_thunk+0x5/0x5f [ 223.425067] ? mark_held_locks+0x54/0x90 [ 223.425076] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425339] ? srso_return_thunk+0x5/0x5f [ 223.425350] mutex_lock_nested+0x1b/0x30 [ 223.425358] ? mutex_lock_nested+0x1b/0x30 [ 223.425367] evict_process_queues_cpsch+0x43/0x210 [amdgpu] [ 223.425631] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu] [ 223.425893] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu] [ 223.426156] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu] [ 223.426423] ? srso_return_thunk+0x5/0x5f [ 223.426436] __mmu_notifier_invalidate_range_start+0x1f5/0x250 [ 223.426450] copy_page_range+0x1e94/0x1ea0 [ 223.426461] ? srso_return_thunk+0x5/0x5f [ 223.426474] ? srso_return_thunk+0x5/0x5f [ 223.426484] ? lock_acquire+0xd1/0x300 [ 223.426494] ? copy_process+0x1718/0x2ad0 [ 223.426502] ? srso_return_thunk+0x5/0x5f [ 223.426510] ? sched_clock_noinstr+0x9/0x10 [ 223.426519] ? local_clock_noinstr+0xe/0xc0 [ 223.426528] ? copy_process+0x1718/0x2ad0 [ 223.426537] ? srso_return_thunk+0x5/0x5f [ 223.426550] copy_process+0x172f/0x2ad0 [ 223.426569] kernel_clone+0x9c/0x3f0 [ 223.426577] ? __schedule+0x4c9/0x1b00 [ 223.426586] ? srso_return_thunk+0x5/0x5f [ 223.426594] ? sched_clock_noinstr+0x9/0x10 [ 223.426602] ? srso_return_thunk+0x5/0x5f [ 223.426610] ? local_clock_noinstr+0xe/0xc0 [ 223.426619] ? schedule+0x107/0x1a0 [ 223.426629] __do_sys_clone+0x66/0x90 [ 223.426643] __x64_sys_clone+0x25/0x30 [ 223.426652] x64_sys_call+0x1d7c/0x20d0 [ 223.426661] do_syscall_64+0x87/0x140 [ 223.426671] ? srso_return_thunk+0x5/0x5f [ 223.426679] ? common_nsleep+0x44/0x50 [ 223.426690] ? srso_return_thunk+0x5/0x5f [ 223.426698] ? trace_hardirqs_off+0x52/0xd0 [ 223.426709] ? srso_return_thunk+0x5/0x5f [ 223.426717] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426727] ? srso_return_thunk+0x5/0x5f [ 223.426736] ? do_syscall_64+0x93/0x140 [ 223.426748] ? srso_return_thunk+0x5/0x5f [ 223.426756] ? up_write+0x1c/0x1e0 [ 223.426765] ? srso_return_thunk+0x5/0x5f [ 223.426775] ? srso_return_thunk+0x5/0x5f [ 223.426783] ? trace_hardirqs_off+0x52/0xd0 [ 223.426792] ? srso_return_thunk+0x5/0x5f [ 223.426800] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426810] ? srso_return_thunk+0x5/0x5f [ 223.426818] ? do_syscall_64+0x93/0x140 [ 223.426826] ? syscall_exit_to_user_mode+0xcc/0x200 [ 223.426836] ? srso_return_thunk+0x5/0x5f [ 223.426844] ? do_syscall_64+0x93/0x140 [ 223.426853] ? srso_return_thunk+0x5/0x5f [ 223.426861] ? irqentry_exit+0x6b/0x90 [ 223.426869] ? srso_return_thunk+0x5/0x5f [ 223.426877] ? exc_page_fault+0xa7/0x2c0 [ 223.426888] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 223.426898] RIP: 0033:0x7f46758eab57 [ 223.426906] Code: ba 04 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 41 89 c0 85 c0 75 2c 64 48 8b 04 25 10 00 [ 223.426930] RSP: 002b:00007fff5c3e5188 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 223.426943] RAX: ffffffffffffffda RBX: 00007f4675f8c040 RCX: 00007f46758eab57 [ 223.426954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 223.426965] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 223.426975] R10: 00007f4675e81a50 R11: 0000000000000246 R12: 0000000000000001 [ 223.426986] R13: 00007fff5c3e5470 R14: 00007fff5c3e53e0 R15: 00007fff5c3e5410 [ 223.427004] </TASK>
v2: To resolve this issue, the allocation of the process context buffer (`proc_ctx_bo`) has been moved from the `add_queue_mes` function to the `pqm_create_queue` function. This change ensures that the buffer is allocated only when the first queue for a process is created and only if the Micro Engine Scheduler (MES) is enabled. (Felix)
v3: Fix typo s/Memory Execution Scheduler (MES)/Micro Engine Scheduler in commit message. (Lijo)
Fixes: 438b39ac74e2 ("drm/amdkfd: pause autosuspend when creating pdd") Cc: Jesse Zhang <[email protected]> Cc: Yunxiang Li <[email protected]> Cc: Philip Yang <[email protected]> Cc: Alex Sierra <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
8544374c |
| 13-Jan-2025 |
Xiaogang Chen <[email protected]> |
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kf
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid values that graphic driver generated which is per vm per pasid.
These pasid values are passed to fw/hardware. We do not need change interrupt handler though more pasid values are used. Also, pasid values at log are replaced by user process pid; pasid values are not exposed to user. Users see their process pids that have meaning in user space.
Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f214b7be |
| 16-Jan-2025 |
Jay Cornwall <[email protected]> |
drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1
The purpose of halt_if_hws_hang is to preserve GPU state for driver debugging when queue preemption fails. Issuing per-queue reset may kill
drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1
The purpose of halt_if_hws_hang is to preserve GPU state for driver debugging when queue preemption fails. Issuing per-queue reset may kill wavefronts which caused the preemption failure.
Signed-off-by: Jay Cornwall <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.12.x
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
88a45aa6 |
| 10-Dec-2024 |
Andrew Martin <[email protected]> |
drm/amdkfd: Failed to check various return code
This patch checks and warns if pdd is NULL.
Signed-off-by: Andrew Martin <[email protected]> Reviewed-by: Harish Kasiviswanathan <Harish.Kasivisw
drm/amdkfd: Failed to check various return code
This patch checks and warns if pdd is NULL.
Signed-off-by: Andrew Martin <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2 |
|
| #
438b39ac |
| 05-Dec-2024 |
[email protected] <[email protected]> |
drm/amdkfd: pause autosuspend when creating pdd
When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in
drm/amdkfd: pause autosuspend when creating pdd
When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in suspend, this causes KVM to EFAULT for passthrough GPU for example. This issue can be masked if the GPU was woken up by other things (e.g. opening the KMS node) first and have not yet gone to sleep.
v4: do the allocation of proc_ctx_bo in a lazy fashion when the first queue is created in a process (Felix)
Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Yunxiang Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
af5661c7 |
| 04-Oct-2024 |
Shaoyun Liu <[email protected]> |
drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling
Add back kfd queues in start scheduling that originally been removed on stop scheduling.
Signed-off-by: Shaoyun Liu <shaoyun.liu@a
drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling
Add back kfd queues in start scheduling that originally been removed on stop scheduling.
Signed-off-by: Shaoyun Liu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Philip Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b6890efb |
| 22-Oct-2024 |
Victor Zhao <[email protected]> |
drm/amdkfd: remove extra use of volatile
as the adding of mb() should be sufficient in function unmap_queues_cpsch, remove the add of volatile type as recommended
Signed-off-by: Victor Zhao <Victor
drm/amdkfd: remove extra use of volatile
as the adding of mb() should be sufficient in function unmap_queues_cpsch, remove the add of volatile type as recommended
Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
88344561 |
| 17-Oct-2024 |
Victor Zhao <[email protected]> |
drm/amdkfd: fix the hang caused by the write reorder to fence_addr
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status called, to avoid qcm fence timeout caused by incorrect ord
drm/amdkfd: fix the hang caused by the write reorder to fence_addr
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status called, to avoid qcm fence timeout caused by incorrect ordering.
Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Philip Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
dafc87dc |
| 03-Oct-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Copy wave state only for compute queue
get_wave_state is not defined for sdma queue, copy_context_work_handler calls it for sdma queue will crash.
Signed-off-by: Philip Yang <Philip.Yan
drm/amdkfd: Copy wave state only for compute queue
get_wave_state is not defined for sdma queue, copy_context_work_handler calls it for sdma queue will crash.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Tested-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1 |
|
| #
e45b011d |
| 20-Sep-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Fix CU occupancy for GFX 9.4.3
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <[email protected]>
drm/amdkfd: Fix CU occupancy for GFX 9.4.3
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
6ae9e1ab |
| 16-Sep-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Update logic for CU occupancy calculations
Currently, the code uses the IH_VMID_X_LUT register to map a queue's vmid to the corresponding PASID. This logic is racy since CP can update th
drm/amdkfd: Update logic for CU occupancy calculations
Currently, the code uses the IH_VMID_X_LUT register to map a queue's vmid to the corresponding PASID. This logic is racy since CP can update the VMID-PASID mapping anytime especially when there are more processes than number of vmids. Update the logic to calculate CU occupancy by matching doorbell offset of the queue with valid wave counts against the process's queues.
Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
101025e9 |
| 22-Aug-2024 |
Jonathan Kim <[email protected]> |
drm/amdkfd: fix missed queue reset on queue destroy
If a queue is being destroyed but causes a HWS hang on removal, the KFD may issue an unnecessary gpu reset if the destroyed queue can be fixed by
drm/amdkfd: fix missed queue reset on queue destroy
If a queue is being destroyed but causes a HWS hang on removal, the KFD may issue an unnecessary gpu reset if the destroyed queue can be fixed by a queue reset.
This is because the queue has been removed from the KFD's queue list prior to the preemption action on destroy so the reset call will fail to match the HQD PQ reset information against the KFD's queue record to do the actual reset.
To fix this, deactivate the queue prior to preemption since it's being destroyed anyways and remove the queue from the KFD's queue list after preemption.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4 |
|
| #
eb067d65 |
| 12-Aug-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Update BadOpcode Interrupt handling with MES
Based on the recommendation of MEC FW, update BadOpcode interrupt handling by unmapping all queues, removing the queue that got the interrupt
drm/amdkfd: Update BadOpcode Interrupt handling with MES
Based on the recommendation of MEC FW, update BadOpcode interrupt handling by unmapping all queues, removing the queue that got the interrupt from scheduling and remapping rest of the queues back when using MES scheduler. This is done to prevent the case where unmapping of the bad queue can fail thereby causing a GPU reset.
Signed-off-by: Mukul Joshi <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
9a16042f |
| 03-Jun-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Update queue unmap after VM fault with MES
MEC FW expects MES to unmap all queues when a VM fault is observed on a queue and then resumed once the affected process is terminated. Use the
drm/amdkfd: Update queue unmap after VM fault with MES
MEC FW expects MES to unmap all queues when a VM fault is observed on a queue and then resumed once the affected process is terminated. Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
234eebe1 |
| 29-Jul-2024 |
Amber Lin <[email protected]> |
drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgp
drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from runlist. If users send ioctls to KFD to create queues, they'll be added but those queues won't be mapped to runlist (so not scheduled) until amdgpu_amdkfd_start_sched is called.
v2: fix build (Alex)
Signed-off-by: Amber Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ee0a469c |
| 25-Jun-2024 |
Jonathan Kim <[email protected]> |
drm/amdkfd: support per-queue reset on gfx9
Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write.
Since this requires p
drm/amdkfd: support per-queue reset on gfx9
Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write.
Since this requires pipe and HW queue info and MEC FW is limited to doorbell reports of hung queues after an unmap failure, scan the HW queue slots defined by SET_RESOURCES first to identify the user queue candidates to reset.
Only signal reset events to processes that have had a queue reset.
If queue reset fails, fall back to GPU reset.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1 |
|
| #
e06b71b2 |
| 21-May-2024 |
Jonathan Kim <[email protected]> |
drm/amdkfd: allow users to target recommended SDMA engines
Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to cr
drm/amdkfd: allow users to target recommended SDMA engines
Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to create SDMA queues on these recommended engines. Close to 2x overall performance has been observed with this optimization.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|