History log of /linux-6.15/drivers/gpu/drm/amd/amdkfd/kfd_process.c (Results 1 – 25 of 218)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# f0b4440c 06-Feb-2025 Philip Yang <[email protected]>

drm/amdkfd: Fix mode1 reset crash issue

If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal
user space to abort the processes. After process abort exit, user queues
still use th

drm/amdkfd: Fix mode1 reset crash issue

If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal
user space to abort the processes. After process abort exit, user queues
still use the GPU to access system memory before h/w is reset while KFD
cleanup worker free system memory and free VRAM.

There is use-after-free race bug that KFD allocate and reuse the freed
system memory, and user queue write to the same system memory to corrupt
the data structure and cause driver crash.

To fix this race, KFD cleanup worker terminate user queues, then flush
reset_domain wq to wait for any GPU ongoing reset complete, and then
free outstanding BOs.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 1b9366c6 18-Feb-2025 Philip Yang <[email protected]>

drm/amdkfd: KFD release_work possible circular locking

If waiting for gpu reset done in KFD release_work, thers is WARNING:
possible circular locking dependency detected

#2 kfd_create_process

drm/amdkfd: KFD release_work possible circular locking

If waiting for gpu reset done in KFD release_work, thers is WARNING:
possible circular locking dependency detected

#2 kfd_create_process
kfd_process_mutex
flush kfd release work

#1 kfd release work
wait for amdgpu reset work

#0 amdgpu_device_gpu_reset
kgd2kfd_pre_reset
kfd_process_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock((work_completion)(&p->release_work));
lock((wq_completion)kfd_process_wq);
lock((work_completion)(&p->release_work));
lock((wq_completion)amdgpu-reset-dev);

To fix this, KFD create process move flush release work outside
kfd_process_mutex.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 2b04d04d 17-Feb-2025 Srinivasan Shanmugam <[email protected]>

drm/amdkfd: Fix error handling for missing PASID in 'kfd_process_device_init_vm'

In the kfd_process_device_init_vm function, a valid error code is now
returned when the associated Process Address Sp

drm/amdkfd: Fix error handling for missing PASID in 'kfd_process_device_init_vm'

In the kfd_process_device_init_vm function, a valid error code is now
returned when the associated Process Address Space ID (PASID) is not
present.

If the address space virtual memory (avm) does not have an associated
PASID, the function sets the ret variable to -EINVAL before proceeding
to the error handling section. This ensures that the calling function,
such as kfd_ioctl_acquire_vm, can appropriately handle the error,
thereby preventing any issues during virtual memory initialization.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1694 kfd_process_device_init_vm()
warn: missing error code 'ret'

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c
1647 int kfd_process_device_init_vm(struct kfd_process_device *pdd,
1648 struct file *drm_file)
1649 {
...
1690
1691 if (unlikely(!avm->pasid)) {
1692 dev_warn(pdd->dev->adev->dev, "WARN: vm %p has no pasid associated",
1693 avm);
--> 1694 goto err_get_pasid;

ret = -EINVAL?

1695 }

Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver")
Reported by: Dan Carpenter <[email protected]>
Cc: Xiaogang Chen <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 10e08943 12-Feb-2025 Xiaogang Chen <[email protected]>

drm/amdkfd: Fix pasid value leak

Curret kfd does not allocate pasid values, instead uses pasid value for each
vm from graphic driver. So should not prevent graphic driver from releasing
pasid values

drm/amdkfd: Fix pasid value leak

Curret kfd does not allocate pasid values, instead uses pasid value for each
vm from graphic driver. So should not prevent graphic driver from releasing
pasid values since the values are allocated by graphic driver, not kfd driver
anymore. This patch does not stop graphic driver release pasid values.

Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver")
Signed-off-by: Xiaogang Chen <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13
# 8544374c 13-Jan-2025 Xiaogang Chen <[email protected]>

drm/amdkfd: Have kfd driver use same PASID values from graphic driver

Current kfd driver has its own PASID value for a kfd process and uses it to
locate vm at interrupt handler or mapping between kf

drm/amdkfd: Have kfd driver use same PASID values from graphic driver

Current kfd driver has its own PASID value for a kfd process and uses it to
locate vm at interrupt handler or mapping between kfd process and vm. That
design is not working when a physical gpu device has multiple spatial
partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid
values that graphic driver generated which is per vm per pasid.

These pasid values are passed to fw/hardware. We do not need change interrupt
handler though more pasid values are used. Also, pasid values at log are
replaced by user process pid; pasid values are not exposed to user. Users see
their process pids that have meaning in user space.

Signed-off-by: Xiaogang Chen <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# a993d319 11-Dec-2024 Zhu Lingshan <[email protected]>

drm/amdkfd: wq_release signals dma_fence only when available

kfd_process_wq_release() signals eviction fence by
dma_fence_signal() which wanrs if dma_fence
is NULL.

kfd_process->ef is initialized b

drm/amdkfd: wq_release signals dma_fence only when available

kfd_process_wq_release() signals eviction fence by
dma_fence_signal() which wanrs if dma_fence
is NULL.

kfd_process->ef is initialized by kfd_process_device_init_vm()
through ioctl. That means the fence is NULL for a new
created kfd_process, and close a kfd_process right
after open it will trigger the warning.

This commit conditionally signals the eviction fence
in kfd_process_wq_release() only when it is available.

[ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0
[ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0
[ 503.877620] Call Trace:
[ 503.880066] <TASK>
[ 503.882168] ? __warn+0xcd/0x260
[ 503.885407] ? dma_fence_signal+0x74/0xa0
[ 503.889416] ? report_bug+0x288/0x2d0
[ 503.893089] ? handle_bug+0x53/0xa0
[ 503.896587] ? exc_invalid_op+0x14/0x50
[ 503.900424] ? asm_exc_invalid_op+0x16/0x20
[ 503.904616] ? dma_fence_signal+0x74/0xa0
[ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu]
[ 503.914081] process_one_work+0x654/0x10a0
[ 503.918186] worker_thread+0x6c3/0xe70
[ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.931527] ? __kthread_parkme+0x82/0x140
[ 503.935631] ? __pfx_worker_thread+0x10/0x10
[ 503.939904] kthread+0x2a8/0x380
[ 503.943132] ? __pfx_kthread+0x10/0x10
[ 503.946882] ret_from_fork+0x2d/0x70
[ 503.950458] ? __pfx_kthread+0x10/0x10
[ 503.954210] ret_from_fork_asm+0x1a/0x30
[ 503.958142] </TASK>
[ 503.960328] ---[ end trace 0000000000000000 ]---

Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling")
Signed-off-by: Zhu Lingshan <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 2774ef7625adb5fb9e9265c26a59dca7b8fd171e)
Cc: [email protected]

show more ...


# 2774ef76 11-Dec-2024 Zhu Lingshan <[email protected]>

drm/amdkfd: wq_release signals dma_fence only when available

kfd_process_wq_release() signals eviction fence by
dma_fence_signal() which wanrs if dma_fence
is NULL.

kfd_process->ef is initialized b

drm/amdkfd: wq_release signals dma_fence only when available

kfd_process_wq_release() signals eviction fence by
dma_fence_signal() which wanrs if dma_fence
is NULL.

kfd_process->ef is initialized by kfd_process_device_init_vm()
through ioctl. That means the fence is NULL for a new
created kfd_process, and close a kfd_process right
after open it will trigger the warning.

This commit conditionally signals the eviction fence
in kfd_process_wq_release() only when it is available.

[ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0
[ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0
[ 503.877620] Call Trace:
[ 503.880066] <TASK>
[ 503.882168] ? __warn+0xcd/0x260
[ 503.885407] ? dma_fence_signal+0x74/0xa0
[ 503.889416] ? report_bug+0x288/0x2d0
[ 503.893089] ? handle_bug+0x53/0xa0
[ 503.896587] ? exc_invalid_op+0x14/0x50
[ 503.900424] ? asm_exc_invalid_op+0x16/0x20
[ 503.904616] ? dma_fence_signal+0x74/0xa0
[ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu]
[ 503.914081] process_one_work+0x654/0x10a0
[ 503.918186] worker_thread+0x6c3/0xe70
[ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.931527] ? __kthread_parkme+0x82/0x140
[ 503.935631] ? __pfx_worker_thread+0x10/0x10
[ 503.939904] kthread+0x2a8/0x380
[ 503.943132] ? __pfx_kthread+0x10/0x10
[ 503.946882] ret_from_fork+0x2d/0x70
[ 503.950458] ? __pfx_kthread+0x10/0x10
[ 503.954210] ret_from_fork_asm+0x1a/0x30
[ 503.958142] </TASK>
[ 503.960328] ---[ end trace 0000000000000000 ]---

Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling")
Signed-off-by: Zhu Lingshan <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6
# 71985559 21-Feb-2024 Alex Sierra <[email protected]>

drm/amdkfd: add gc 9.5.0 support on kfd

Initial support for GC 9.5.0.

v2: squash in pqm_clean_queue_resource() fix from Lijo

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kue

drm/amdkfd: add gc 9.5.0 support on kfd

Initial support for GC 9.5.0.

v2: squash in pqm_clean_queue_resource() fix from Lijo

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 438b39ac 05-Dec-2024 [email protected] <[email protected]>

drm/amdkfd: pause autosuspend when creating pdd

When using MES creating a pdd will require talking to the GPU to
setup the relevant context. The code here forgot to wake up the GPU
in case it was in

drm/amdkfd: pause autosuspend when creating pdd

When using MES creating a pdd will require talking to the GPU to
setup the relevant context. The code here forgot to wake up the GPU
in case it was in suspend, this causes KVM to EFAULT for passthrough
GPU for example. This issue can be masked if the GPU was woken up by
other things (e.g. opening the KMS node) first and have not yet gone to sleep.

v4: do the allocation of proc_ctx_bo in a lazy fashion
when the first queue is created in a process (Felix)

Signed-off-by: Jesse Zhang <[email protected]>
Reviewed-by: Yunxiang Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

show more ...


# 6cb6d437 28-Oct-2024 Xiaogang Chen <[email protected]>

drm/amdkfd: change kfd process kref count at creation

kfd process kref count(process->ref) is initialized to 1 by kref_init. After
it is created not need to increase its kref. Instad add kfd process

drm/amdkfd: change kfd process kref count at creation

kfd process kref count(process->ref) is initialized to 1 by kref_init. After
it is created not need to increase its kref. Instad add kfd process kref at kfd
process mmu notifier allocation since we already decrease the kref at
free_notifier of mmu_notifier_ops, so pair them.

When user process opens kfd node multiple times the kfd process kref is
increased each time to balance with kfd node close operation.

Signed-off-by: Xiaogang Chen <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 21cae8de 06-Nov-2024 Yuan Can <[email protected]>

drm/amdkfd: Fix wrong usage of INIT_WORK()

In kfd_procfs_show(), the sdma_activity_work_handler is a local variable
and the sdma_activity_work_handler.sdma_activity_work should initialize
with INIT_

drm/amdkfd: Fix wrong usage of INIT_WORK()

In kfd_procfs_show(), the sdma_activity_work_handler is a local variable
and the sdma_activity_work_handler.sdma_activity_work should initialize
with INIT_WORK_ONSTACK() instead of INIT_WORK().

Fixes: 32cb59f31362 ("drm/amdkfd: Track SDMA utilization per process")
Signed-off-by: Yuan Can <[email protected]>
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 922f0e00 04-Oct-2024 Srinivasan Shanmugam <[email protected]>

drm/amdkfd: Use dynamic allocation for CU occupancy array in 'kfd_get_cu_occupancy()'

The `kfd_get_cu_occupancy` function previously declared a large
`cu_occupancy` array as a local variable, which

drm/amdkfd: Use dynamic allocation for CU occupancy array in 'kfd_get_cu_occupancy()'

The `kfd_get_cu_occupancy` function previously declared a large
`cu_occupancy` array as a local variable, which could lead to stack
overflows due to excessive stack usage. This commit replaces the static
array allocation with dynamic memory allocation using `kcalloc`,
thereby reducing the stack size.

This change avoids the risk of stack overflows in kernel space, in
scenarios where `AMDGPU_MAX_QUEUES` is large. The allocated memory is
freed using `kfree` before the function returns to prevent memory
leaks.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c: In function ‘kfd_get_cu_occupancy’:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:322:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]
322 | }
| ^

Fixes: 6ae9e1aba97e ("drm/amdkfd: Update logic for CU occupancy calculations")
Cc: Harish Kasiviswanathan <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Mukul Joshi <[email protected]>
Reviewed-by: Mukul Joshi <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 68d26c10 04-Oct-2024 Philip Yang <[email protected]>

drm/amdkfd: Accounting pdd vram_usage for svm

Process device data pdd->vram_usage is read by rocm-smi via sysfs, this
is currently missing the svm_bo usage accounting, so "rocm-smi
--showpids" per p

drm/amdkfd: Accounting pdd vram_usage for svm

Process device data pdd->vram_usage is read by rocm-smi via sysfs, this
is currently missing the svm_bo usage accounting, so "rocm-smi
--showpids" per process VRAM usage report is incorrect.

Add pdd->vram_usage accounting when svm_bo allocation and release,
change to atomic64_t type because it is updated outside process mutex
now.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 98c0b0efcc11f2a5ddf3ce33af1e48eedf808b04)

show more ...


# 98c0b0ef 04-Oct-2024 Philip Yang <[email protected]>

drm/amdkfd: Accounting pdd vram_usage for svm

Process device data pdd->vram_usage is read by rocm-smi via sysfs, this
is currently missing the svm_bo usage accounting, so "rocm-smi
--showpids" per p

drm/amdkfd: Accounting pdd vram_usage for svm

Process device data pdd->vram_usage is read by rocm-smi via sysfs, this
is currently missing the svm_bo usage accounting, so "rocm-smi
--showpids" per process VRAM usage report is incorrect.

Add pdd->vram_usage accounting when svm_bo allocation and release,
change to atomic64_t type because it is updated outside process mutex
now.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d7d7b947 27-Sep-2024 Lang Yu <[email protected]>

drm/amdkfd: Fix an eviction fence leak

Only creating a new reference for each process instead of each VM.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Suggested-by: Feli

drm/amdkfd: Fix an eviction fence leak

Only creating a new reference for each process instead of each VM.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Suggested-by: Felix Kuehling <[email protected]>
Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 5fa436289483ae56427b0896c31f72361223c758)
Cc: [email protected]

show more ...


# 5fa43628 27-Sep-2024 Lang Yu <[email protected]>

drm/amdkfd: Fix an eviction fence leak

Only creating a new reference for each process instead of each VM.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Suggested-by: Feli

drm/amdkfd: Fix an eviction fence leak

Only creating a new reference for each process instead of each VM.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Suggested-by: Felix Kuehling <[email protected]>
Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e45b011d 20-Sep-2024 Mukul Joshi <[email protected]>

drm/amdkfd: Fix CU occupancy for GFX 9.4.3

Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.

Signed-off-by: Mukul Joshi <[email protected]>

drm/amdkfd: Fix CU occupancy for GFX 9.4.3

Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.

Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 6ae9e1ab 16-Sep-2024 Mukul Joshi <[email protected]>

drm/amdkfd: Update logic for CU occupancy calculations

Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update th

drm/amdkfd: Update logic for CU occupancy calculations

Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update the VMID-PASID mapping anytime especially
when there are more processes than number of vmids. Update the
logic to calculate CU occupancy by matching doorbell offset of
the queue with valid wave counts against the process's queues.

Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# ee0a469c 25-Jun-2024 Jonathan Kim <[email protected]>

drm/amdkfd: support per-queue reset on gfx9

Support per-queue reset for GFX9. The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.

Since this requires p

drm/amdkfd: support per-queue reset on gfx9

Support per-queue reset for GFX9. The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.

Since this requires pipe and HW queue info and MEC FW is limited to
doorbell reports of hung queues after an unmap failure, scan the HW
queue slots defined by SET_RESOURCES first to identify the user queue
candidates to reset.

Only signal reset events to processes that have had a queue reset.

If queue reset fails, fall back to GPU reset.

Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# c86ad391 14-Jul-2024 Philip Yang <[email protected]>

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original poi

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 62ec7d38 24-Jun-2024 Lijo Lazar <[email protected]>

drm/amdkfd: Use device based logging for errors

Convert some pr_* to some dev_* APIs to identify the device.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Alex Deucher <alexander.deuc

drm/amdkfd: Use device based logging for errors

Convert some pr_* to some dev_* APIs to identify the device.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 5f571c61 16-Apr-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Add gfx v9_4_4 ip block

Add gfx v9_4_4 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <alexander.de

drm/amdgpu: Add gfx v9_4_4 ip block

Add gfx v9_4_4 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a89a05e3 10-Apr-2024 Lancelot SIX <[email protected]>

drm/amdkfd: Flush the process wq before creating a kfd_process

There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger execut

drm/amdkfd: Flush the process wq before creating a kfd_process

There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3). In this scenario:
- The process executes exec.
- This will eventually release the process's mm, which will cause the
kfd_process object associated with the process to be freed
(kfd_process_free_notifier decrements the reference count to the
kfd_process to 0). This causes kfd_process_ref_release to enqueue
kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
- When handling this request, KFD tries to re-create a kfd_process.
This eventually calls kfd_create_process and kobject_init_and_add.

At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.

This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object. This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.

Signed-off-by: Lancelot SIX <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f5b90533 10-Apr-2024 Lancelot SIX <[email protected]>

drm/amdkfd: Flush the process wq before creating a kfd_process

There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger execut

drm/amdkfd: Flush the process wq before creating a kfd_process

There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3). In this scenario:
- The process executes exec.
- This will eventually release the process's mm, which will cause the
kfd_process object associated with the process to be freed
(kfd_process_free_notifier decrements the reference count to the
kfd_process to 0). This causes kfd_process_ref_release to enqueue
kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
- When handling this request, KFD tries to re-create a kfd_process.
This eventually calls kfd_create_process and kobject_init_and_add.

At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.

This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object. This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.

Signed-off-by: Lancelot SIX <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f989eccc 19-Apr-2024 Felix Kuehling <[email protected]>

drm/amdkfd: Fix rescheduling of restore worker

Handle the case that the restore worker was already scheduled by another
eviction while the restore was in progress.

Fixes: 9a1c1339abf9 ("drm/amdkfd:

drm/amdkfd: Fix rescheduling of restore worker

Handle the case that the restore worker was already scheduled by another
eviction while the restore was in progress.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Philip Yang <[email protected]>
Tested-by: Yunxiang Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


123456789