|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
f0b4440c |
| 06-Feb-2025 |
Philip Yang <[email protected]> |
drm/amdkfd: Fix mode1 reset crash issue
If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal user space to abort the processes. After process abort exit, user queues still use th
drm/amdkfd: Fix mode1 reset crash issue
If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal user space to abort the processes. After process abort exit, user queues still use the GPU to access system memory before h/w is reset while KFD cleanup worker free system memory and free VRAM.
There is use-after-free race bug that KFD allocate and reuse the freed system memory, and user queue write to the same system memory to corrupt the data structure and cause driver crash.
To fix this race, KFD cleanup worker terminate user queues, then flush reset_domain wq to wait for any GPU ongoing reset complete, and then free outstanding BOs.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
1b9366c6 |
| 18-Feb-2025 |
Philip Yang <[email protected]> |
drm/amdkfd: KFD release_work possible circular locking
If waiting for gpu reset done in KFD release_work, thers is WARNING: possible circular locking dependency detected
#2 kfd_create_process
drm/amdkfd: KFD release_work possible circular locking
If waiting for gpu reset done in KFD release_work, thers is WARNING: possible circular locking dependency detected
#2 kfd_create_process kfd_process_mutex flush kfd release work
#1 kfd release work wait for amdgpu reset work
#0 amdgpu_device_gpu_reset kgd2kfd_pre_reset kfd_process_mutex
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock((work_completion)(&p->release_work)); lock((wq_completion)kfd_process_wq); lock((work_completion)(&p->release_work)); lock((wq_completion)amdgpu-reset-dev);
To fix this, KFD create process move flush release work outside kfd_process_mutex.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
2b04d04d |
| 17-Feb-2025 |
Srinivasan Shanmugam <[email protected]> |
drm/amdkfd: Fix error handling for missing PASID in 'kfd_process_device_init_vm'
In the kfd_process_device_init_vm function, a valid error code is now returned when the associated Process Address Sp
drm/amdkfd: Fix error handling for missing PASID in 'kfd_process_device_init_vm'
In the kfd_process_device_init_vm function, a valid error code is now returned when the associated Process Address Space ID (PASID) is not present.
If the address space virtual memory (avm) does not have an associated PASID, the function sets the ret variable to -EINVAL before proceeding to the error handling section. This ensures that the calling function, such as kfd_ioctl_acquire_vm, can appropriately handle the error, thereby preventing any issues during virtual memory initialization.
Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1694 kfd_process_device_init_vm() warn: missing error code 'ret'
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c 1647 int kfd_process_device_init_vm(struct kfd_process_device *pdd, 1648 struct file *drm_file) 1649 { ... 1690 1691 if (unlikely(!avm->pasid)) { 1692 dev_warn(pdd->dev->adev->dev, "WARN: vm %p has no pasid associated", 1693 avm); --> 1694 goto err_get_pasid;
ret = -EINVAL?
1695 }
Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver") Reported by: Dan Carpenter <[email protected]> Cc: Xiaogang Chen <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
10e08943 |
| 12-Feb-2025 |
Xiaogang Chen <[email protected]> |
drm/amdkfd: Fix pasid value leak
Curret kfd does not allocate pasid values, instead uses pasid value for each vm from graphic driver. So should not prevent graphic driver from releasing pasid values
drm/amdkfd: Fix pasid value leak
Curret kfd does not allocate pasid values, instead uses pasid value for each vm from graphic driver. So should not prevent graphic driver from releasing pasid values since the values are allocated by graphic driver, not kfd driver anymore. This patch does not stop graphic driver release pasid values.
Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver") Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13 |
|
| #
8544374c |
| 13-Jan-2025 |
Xiaogang Chen <[email protected]> |
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kf
drm/amdkfd: Have kfd driver use same PASID values from graphic driver
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid values that graphic driver generated which is per vm per pasid.
These pasid values are passed to fw/hardware. We do not need change interrupt handler though more pasid values are used. Also, pasid values at log are replaced by user process pid; pasid values are not exposed to user. Users see their process pids that have meaning in user space.
Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
a993d319 |
| 11-Dec-2024 |
Zhu Lingshan <[email protected]> |
drm/amdkfd: wq_release signals dma_fence only when available
kfd_process_wq_release() signals eviction fence by dma_fence_signal() which wanrs if dma_fence is NULL.
kfd_process->ef is initialized b
drm/amdkfd: wq_release signals dma_fence only when available
kfd_process_wq_release() signals eviction fence by dma_fence_signal() which wanrs if dma_fence is NULL.
kfd_process->ef is initialized by kfd_process_device_init_vm() through ioctl. That means the fence is NULL for a new created kfd_process, and close a kfd_process right after open it will trigger the warning.
This commit conditionally signals the eviction fence in kfd_process_wq_release() only when it is available.
[ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0 [ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0 [ 503.877620] Call Trace: [ 503.880066] <TASK> [ 503.882168] ? __warn+0xcd/0x260 [ 503.885407] ? dma_fence_signal+0x74/0xa0 [ 503.889416] ? report_bug+0x288/0x2d0 [ 503.893089] ? handle_bug+0x53/0xa0 [ 503.896587] ? exc_invalid_op+0x14/0x50 [ 503.900424] ? asm_exc_invalid_op+0x16/0x20 [ 503.904616] ? dma_fence_signal+0x74/0xa0 [ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu] [ 503.914081] process_one_work+0x654/0x10a0 [ 503.918186] worker_thread+0x6c3/0xe70 [ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.931527] ? __kthread_parkme+0x82/0x140 [ 503.935631] ? __pfx_worker_thread+0x10/0x10 [ 503.939904] kthread+0x2a8/0x380 [ 503.943132] ? __pfx_kthread+0x10/0x10 [ 503.946882] ret_from_fork+0x2d/0x70 [ 503.950458] ? __pfx_kthread+0x10/0x10 [ 503.954210] ret_from_fork_asm+0x1a/0x30 [ 503.958142] </TASK> [ 503.960328] ---[ end trace 0000000000000000 ]---
Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling") Signed-off-by: Zhu Lingshan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 2774ef7625adb5fb9e9265c26a59dca7b8fd171e) Cc: [email protected]
show more ...
|
| #
2774ef76 |
| 11-Dec-2024 |
Zhu Lingshan <[email protected]> |
drm/amdkfd: wq_release signals dma_fence only when available
kfd_process_wq_release() signals eviction fence by dma_fence_signal() which wanrs if dma_fence is NULL.
kfd_process->ef is initialized b
drm/amdkfd: wq_release signals dma_fence only when available
kfd_process_wq_release() signals eviction fence by dma_fence_signal() which wanrs if dma_fence is NULL.
kfd_process->ef is initialized by kfd_process_device_init_vm() through ioctl. That means the fence is NULL for a new created kfd_process, and close a kfd_process right after open it will trigger the warning.
This commit conditionally signals the eviction fence in kfd_process_wq_release() only when it is available.
[ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0 [ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0 [ 503.877620] Call Trace: [ 503.880066] <TASK> [ 503.882168] ? __warn+0xcd/0x260 [ 503.885407] ? dma_fence_signal+0x74/0xa0 [ 503.889416] ? report_bug+0x288/0x2d0 [ 503.893089] ? handle_bug+0x53/0xa0 [ 503.896587] ? exc_invalid_op+0x14/0x50 [ 503.900424] ? asm_exc_invalid_op+0x16/0x20 [ 503.904616] ? dma_fence_signal+0x74/0xa0 [ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu] [ 503.914081] process_one_work+0x654/0x10a0 [ 503.918186] worker_thread+0x6c3/0xe70 [ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.931527] ? __kthread_parkme+0x82/0x140 [ 503.935631] ? __pfx_worker_thread+0x10/0x10 [ 503.939904] kthread+0x2a8/0x380 [ 503.943132] ? __pfx_kthread+0x10/0x10 [ 503.946882] ret_from_fork+0x2d/0x70 [ 503.950458] ? __pfx_kthread+0x10/0x10 [ 503.954210] ret_from_fork_asm+0x1a/0x30 [ 503.958142] </TASK> [ 503.960328] ---[ end trace 0000000000000000 ]---
Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling") Signed-off-by: Zhu Lingshan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
71985559 |
| 21-Feb-2024 |
Alex Sierra <[email protected]> |
drm/amdkfd: add gc 9.5.0 support on kfd
Initial support for GC 9.5.0.
v2: squash in pqm_clean_queue_resource() fix from Lijo
Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kue
drm/amdkfd: add gc 9.5.0 support on kfd
Initial support for GC 9.5.0.
v2: squash in pqm_clean_queue_resource() fix from Lijo
Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
438b39ac |
| 05-Dec-2024 |
[email protected] <[email protected]> |
drm/amdkfd: pause autosuspend when creating pdd
When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in
drm/amdkfd: pause autosuspend when creating pdd
When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in suspend, this causes KVM to EFAULT for passthrough GPU for example. This issue can be masked if the GPU was woken up by other things (e.g. opening the KMS node) first and have not yet gone to sleep.
v4: do the allocation of proc_ctx_bo in a lazy fashion when the first queue is created in a process (Felix)
Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Yunxiang Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
show more ...
|
| #
6cb6d437 |
| 28-Oct-2024 |
Xiaogang Chen <[email protected]> |
drm/amdkfd: change kfd process kref count at creation
kfd process kref count(process->ref) is initialized to 1 by kref_init. After it is created not need to increase its kref. Instad add kfd process
drm/amdkfd: change kfd process kref count at creation
kfd process kref count(process->ref) is initialized to 1 by kref_init. After it is created not need to increase its kref. Instad add kfd process kref at kfd process mmu notifier allocation since we already decrease the kref at free_notifier of mmu_notifier_ops, so pair them.
When user process opens kfd node multiple times the kfd process kref is increased each time to balance with kfd node close operation.
Signed-off-by: Xiaogang Chen <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
21cae8de |
| 06-Nov-2024 |
Yuan Can <[email protected]> |
drm/amdkfd: Fix wrong usage of INIT_WORK()
In kfd_procfs_show(), the sdma_activity_work_handler is a local variable and the sdma_activity_work_handler.sdma_activity_work should initialize with INIT_
drm/amdkfd: Fix wrong usage of INIT_WORK()
In kfd_procfs_show(), the sdma_activity_work_handler is a local variable and the sdma_activity_work_handler.sdma_activity_work should initialize with INIT_WORK_ONSTACK() instead of INIT_WORK().
Fixes: 32cb59f31362 ("drm/amdkfd: Track SDMA utilization per process") Signed-off-by: Yuan Can <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
922f0e00 |
| 04-Oct-2024 |
Srinivasan Shanmugam <[email protected]> |
drm/amdkfd: Use dynamic allocation for CU occupancy array in 'kfd_get_cu_occupancy()'
The `kfd_get_cu_occupancy` function previously declared a large `cu_occupancy` array as a local variable, which
drm/amdkfd: Use dynamic allocation for CU occupancy array in 'kfd_get_cu_occupancy()'
The `kfd_get_cu_occupancy` function previously declared a large `cu_occupancy` array as a local variable, which could lead to stack overflows due to excessive stack usage. This commit replaces the static array allocation with dynamic memory allocation using `kcalloc`, thereby reducing the stack size.
This change avoids the risk of stack overflows in kernel space, in scenarios where `AMDGPU_MAX_QUEUES` is large. The allocated memory is freed using `kfree` before the function returns to prevent memory leaks.
Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c: In function ‘kfd_get_cu_occupancy’: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:322:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=] 322 | } | ^
Fixes: 6ae9e1aba97e ("drm/amdkfd: Update logic for CU occupancy calculations") Cc: Harish Kasiviswanathan <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Mukul Joshi <[email protected]> Reviewed-by: Mukul Joshi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
68d26c10 |
| 04-Oct-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Accounting pdd vram_usage for svm
Process device data pdd->vram_usage is read by rocm-smi via sysfs, this is currently missing the svm_bo usage accounting, so "rocm-smi --showpids" per p
drm/amdkfd: Accounting pdd vram_usage for svm
Process device data pdd->vram_usage is read by rocm-smi via sysfs, this is currently missing the svm_bo usage accounting, so "rocm-smi --showpids" per process VRAM usage report is incorrect.
Add pdd->vram_usage accounting when svm_bo allocation and release, change to atomic64_t type because it is updated outside process mutex now.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 98c0b0efcc11f2a5ddf3ce33af1e48eedf808b04)
show more ...
|
| #
98c0b0ef |
| 04-Oct-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: Accounting pdd vram_usage for svm
Process device data pdd->vram_usage is read by rocm-smi via sysfs, this is currently missing the svm_bo usage accounting, so "rocm-smi --showpids" per p
drm/amdkfd: Accounting pdd vram_usage for svm
Process device data pdd->vram_usage is read by rocm-smi via sysfs, this is currently missing the svm_bo usage accounting, so "rocm-smi --showpids" per process VRAM usage report is incorrect.
Add pdd->vram_usage accounting when svm_bo allocation and release, change to atomic64_t type because it is updated outside process mutex now.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d7d7b947 |
| 27-Sep-2024 |
Lang Yu <[email protected]> |
drm/amdkfd: Fix an eviction fence leak
Only creating a new reference for each process instead of each VM.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Suggested-by: Feli
drm/amdkfd: Fix an eviction fence leak
Only creating a new reference for each process instead of each VM.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Suggested-by: Felix Kuehling <[email protected]> Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 5fa436289483ae56427b0896c31f72361223c758) Cc: [email protected]
show more ...
|
| #
5fa43628 |
| 27-Sep-2024 |
Lang Yu <[email protected]> |
drm/amdkfd: Fix an eviction fence leak
Only creating a new reference for each process instead of each VM.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Suggested-by: Feli
drm/amdkfd: Fix an eviction fence leak
Only creating a new reference for each process instead of each VM.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Suggested-by: Felix Kuehling <[email protected]> Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e45b011d |
| 20-Sep-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Fix CU occupancy for GFX 9.4.3
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <[email protected]>
drm/amdkfd: Fix CU occupancy for GFX 9.4.3
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
6ae9e1ab |
| 16-Sep-2024 |
Mukul Joshi <[email protected]> |
drm/amdkfd: Update logic for CU occupancy calculations
Currently, the code uses the IH_VMID_X_LUT register to map a queue's vmid to the corresponding PASID. This logic is racy since CP can update th
drm/amdkfd: Update logic for CU occupancy calculations
Currently, the code uses the IH_VMID_X_LUT register to map a queue's vmid to the corresponding PASID. This logic is racy since CP can update the VMID-PASID mapping anytime especially when there are more processes than number of vmids. Update the logic to calculate CU occupancy by matching doorbell offset of the queue with valid wave counts against the process's queues.
Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ee0a469c |
| 25-Jun-2024 |
Jonathan Kim <[email protected]> |
drm/amdkfd: support per-queue reset on gfx9
Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write.
Since this requires p
drm/amdkfd: support per-queue reset on gfx9
Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write.
Since this requires pipe and HW queue info and MEC FW is limited to doorbell reports of hung queues after an unmap failure, scan the HW queue slots defined by SET_RESOURCES first to identify the user queue candidates to reset.
Only signal reset events to processes that have had a queue reset.
If queue reset fails, fall back to GPU reset.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
c86ad391 |
| 14-Jul-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original poi
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
62ec7d38 |
| 24-Jun-2024 |
Lijo Lazar <[email protected]> |
drm/amdkfd: Use device based logging for errors
Convert some pr_* to some dev_* APIs to identify the device.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <alexander.deuc
drm/amdkfd: Use device based logging for errors
Convert some pr_* to some dev_* APIs to identify the device.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5f571c61 |
| 16-Apr-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Add gfx v9_4_4 ip block
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <alexander.de
drm/amdgpu: Add gfx v9_4_4 ip block
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a89a05e3 |
| 10-Apr-2024 |
Lancelot SIX <[email protected]> |
drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger execut
drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger executes exec(3). In this scenario: - The process executes exec. - This will eventually release the process's mm, which will cause the kfd_process object associated with the process to be freed (kfd_process_free_notifier decrements the reference count to the kfd_process to 0). This causes kfd_process_ref_release to enqueue kfd_process_wq_release to the kfd_process_wq. - The debugger receives the PTRACE_EVENT_EXEC notification, and tries to re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE). - When handling this request, KFD tries to re-create a kfd_process. This eventually calls kfd_create_process and kobject_init_and_add.
At this point the call to kobject_init_and_add can fail because the old kfd_process.kobj has not been freed yet by kfd_process_wq_release.
This patch proposes to avoid this race by making sure to drain kfd_process_wq before creating a new kfd_process object. This way, we know that any cleanup task is done executing when we reach kobject_init_and_add.
Signed-off-by: Lancelot SIX <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f5b90533 |
| 10-Apr-2024 |
Lancelot SIX <[email protected]> |
drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger execut
drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger executes exec(3). In this scenario: - The process executes exec. - This will eventually release the process's mm, which will cause the kfd_process object associated with the process to be freed (kfd_process_free_notifier decrements the reference count to the kfd_process to 0). This causes kfd_process_ref_release to enqueue kfd_process_wq_release to the kfd_process_wq. - The debugger receives the PTRACE_EVENT_EXEC notification, and tries to re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE). - When handling this request, KFD tries to re-create a kfd_process. This eventually calls kfd_create_process and kobject_init_and_add.
At this point the call to kobject_init_and_add can fail because the old kfd_process.kobj has not been freed yet by kfd_process_wq_release.
This patch proposes to avoid this race by making sure to drain kfd_process_wq before creating a new kfd_process object. This way, we know that any cleanup task is done executing when we reach kobject_init_and_add.
Signed-off-by: Lancelot SIX <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f989eccc |
| 19-Apr-2024 |
Felix Kuehling <[email protected]> |
drm/amdkfd: Fix rescheduling of restore worker
Handle the case that the restore worker was already scheduled by another eviction while the restore was in progress.
Fixes: 9a1c1339abf9 ("drm/amdkfd:
drm/amdkfd: Fix rescheduling of restore worker
Handle the case that the restore worker was already scheduled by another eviction while the restore was in progress.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Signed-off-by: Felix Kuehling <[email protected]> Reviewed-by: Philip Yang <[email protected]> Tested-by: Yunxiang Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|