|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
9424a5bf |
| 10-Feb-2025 |
Jonathan Kim <[email protected]> |
drm/amdgpu: simplify xgmi peer info calls
Deprecate KFD XGMI peer info calls in favour of calling directly from simplified XGMI peer info functions.
Signed-off-by: Jonathan Kim <[email protected]
drm/amdgpu: simplify xgmi peer info calls
Deprecate KFD XGMI peer info calls in favour of calling directly from simplified XGMI peer info functions.
Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
8b0d068e |
| 30-Jan-2025 |
Alex Deucher <[email protected]> |
drm/amdkfd: add a new flag to manage where VRAM allocations go
On big and small APUs we send KFD VRAM allocations to GTT since the carve out is either non-existent or relatively small. However, if
drm/amdkfd: add a new flag to manage where VRAM allocations go
On big and small APUs we send KFD VRAM allocations to GTT since the carve out is either non-existent or relatively small. However, if someone sets the carve out size to be relatively large, we may end up using GTT rather than VRAM.
No change of logic with this patch, but it allows the driver to determine which logic to use based on the carve out size in the future.
Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7 |
|
| #
90505894 |
| 09-Jan-2025 |
Kenneth Feng <[email protected]> |
drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure.
Signed-off-by: Kenneth Feng <kenne
drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure.
Signed-off-by: Kenneth Feng <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 2affe2bbc997b3920045c2c434e480c81a5f9707) Cc: [email protected] # 6.12.x
show more ...
|
| #
2affe2bb |
| 09-Jan-2025 |
Kenneth Feng <[email protected]> |
drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure.
Signed-off-by: Kenneth Feng <kenne
drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure.
Signed-off-by: Kenneth Feng <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
11815bb0 |
| 12-Dec-2024 |
Christian König <[email protected]> |
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely worrying about CPU barriers.
Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that.
Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
357ef5b3 |
| 10-Dec-2024 |
Andrew Martin <[email protected]> |
drm/amdgpu: Failed to check various return code
Clean up code to quiet the compiler on us failing to check the return code.
Signed-off-by: Andrew Martin <[email protected]> Reviewed-by: Harish
drm/amdgpu: Failed to check various return code
Clean up code to quiet the compiler on us failing to check the return code.
Signed-off-by: Andrew Martin <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
80d80511 |
| 29-Sep-2024 |
Boyuan Zhang <[email protected]> |
drm/amdgpu: pass ip_block in set_powergating_state
Pass ip_block instead of adev in set_powergating_state callback function. Modify set_powergating_state ip functions for all correspoding ip blocks.
drm/amdgpu: pass ip_block in set_powergating_state
Pass ip_block instead of adev in set_powergating_state callback function. Modify set_powergating_state ip functions for all correspoding ip blocks.
v2: fix a ip block index error.
v3: remove type casting
Signed-off-by: Boyuan Zhang <[email protected]> Suggested-by: Christian König <[email protected]> Acked-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fa317985 |
| 05-Nov-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Fix map/unmap queue logic
In current logic, it calls ring_alloc followed by a ring_test. ring_test in turn will call another ring_alloc. This is illegal usage as a ring_alloc is expected
drm/amdgpu: Fix map/unmap queue logic
In current logic, it calls ring_alloc followed by a ring_test. ring_test in turn will call another ring_alloc. This is illegal usage as a ring_alloc is expected to be closed properly with a ring_commit. Change to commit the map/unmap queue packet first followed by a ring_test. Add a comment about the usage of ring_test.
Also, reorder the current pre-condition checks of job hang or kiq ring scheduler not ready. Without them being met, it is not useful to attempt ring or memory allocations.
Fixes tag refers to the original patch which introduced this issue which then got carried over into newer code.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Fixes: 6c10b5cc4eaa ("drm/amdgpu: Remove duplicate code in gfx_v8_0.c")
show more ...
|
| #
8fe7cf58 |
| 14-Oct-2024 |
Alex Deucher <[email protected]> |
drm/amdkfd: add an interface to query whether is KFD is active
Add an interface to query whether KFD has any active queues.
v2: fix build issues
Acked-by: Srinivasan Shanmugam <srinivasan.shanmuga
drm/amdkfd: add an interface to query whether is KFD is active
Add an interface to query whether KFD has any active queues.
v2: fix build issues
Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11 |
|
| #
3eebfd5e |
| 12-Sep-2024 |
Feifei Xu <[email protected]> |
drm/amdkfd:Add kfd function to config sq perfmon
Expose the interface for kfd to config sq perfmon.
Signed-off-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> R
drm/amdkfd:Add kfd function to config sq perfmon
Expose the interface for kfd to config sq perfmon.
Signed-off-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: James Zhu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
b05d6476 |
| 19-Aug-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Retire query_utcl2_poison_status callback
Driver switches to interrupt source id to identify utcl2 poison event. polling interface is not needed.
Signed-off-by: Hawking Zhang <Hawking.Z
drm/amdgpu: Retire query_utcl2_poison_status callback
Driver switches to interrupt source id to identify utcl2 poison event. polling interface is not needed.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
234eebe1 |
| 29-Jul-2024 |
Amber Lin <[email protected]> |
drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgp
drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from runlist. If users send ioctls to KFD to create queues, they'll be added but those queues won't be mapped to runlist (so not scheduled) until amdgpu_amdkfd_start_sched is called.
v2: fix build (Alex)
Signed-off-by: Amber Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10 |
|
| #
c86ad391 |
| 14-Jul-2024 |
Philip Yang <[email protected]> |
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original poi
drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
dbe2c4c8 |
| 03-Jun-2024 |
Eric Huang <[email protected]> |
drm/amdkfd: add reset cause in gpu pre-reset smi event
reset cause is requested by customer as additional info for gpu reset smi event.
v2: integerate reset sources suggested by Lijo Lazar
Signed-
drm/amdkfd: add reset cause in gpu pre-reset smi event
reset cause is requested by customer as additional info for gpu reset smi event.
v2: integerate reset sources suggested by Lijo Lazar
Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
89773b85 |
| 26-Apr-2024 |
Lang Yu <[email protected]> |
drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads m
drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements.
We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist. Error Log:
"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes is free. Of the allocated memory 103.83 MiB is allocated by PyTorch, and 22.17 MiB is reserved by PyTorch but unallocated"
Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device.
The solution is MI300A approach, i.e., let VRAM allocations go to GTT. Then device and host can flexibly and effectively share memory resource.
v2: Report local_mem_size_private as 0. (Felix)
Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
eb853413 |
| 26-Apr-2024 |
Lang Yu <[email protected]> |
drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads m
drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements.
We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist. Error Log:
"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes is free. Of the allocated memory 103.83 MiB is allocated by PyTorch, and 22.17 MiB is reserved by PyTorch but unallocated"
Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device.
The solution is MI300A approach, i.e., let VRAM allocations go to GTT. Then device and host can flexibly and effectively share memory resource.
v2: Report local_mem_size_private as 0. (Felix)
Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
bfa579b3 |
| 22-Apr-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: prepare to handle pasid poison consumption
Prepare to handle pasid poison consumption.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-
drm/amdgpu: prepare to handle pasid poison consumption
Prepare to handle pasid poison consumption.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
2fc46e0b |
| 12-Mar-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: make reset method configurable for RAS poison
Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling
drm/amdgpu: make reset method configurable for RAS poison
Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling.
v2: remove the mmhub poison support for kfd int v10.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d8070c42 |
| 11-Mar-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: support utcl2 RAS poison query for mmhub
Support the query for both gfxhub and mmhub, also replace xcc_id with hub_inst.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking
drm/amdgpu: support utcl2 RAS poison query for mmhub
Support the query for both gfxhub and mmhub, also replace xcc_id with hub_inst.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
71a8d61e |
| 19-Feb-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: retire gfx ras query_utcl2_poison_status
Replace it with related interface in gfxhub functions.
v2: replace node id with xcc id. get node id for query_utcl2_poison_status
Signed-of
drm/amdgpu: retire gfx ras query_utcl2_poison_status
Replace it with related interface in gfxhub functions.
v2: replace node id with xcc id. get node id for query_utcl2_poison_status
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f679fd60 |
| 04-Mar-2024 |
Ahmad Rehman <[email protected]> |
drm/amdgpu: Init zone device and drm client after mode-1 reset on reload
In passthrough environment, when amdgpu is reloaded after unload, mode-1 is triggered after initializing the necessary IPs, T
drm/amdgpu: Init zone device and drm client after mode-1 reset on reload
In passthrough environment, when amdgpu is reloaded after unload, mode-1 is triggered after initializing the necessary IPs, That init does not include KFD, and KFD init waits until the reset is completed. KFD init is called in the reset handler, but in this case, the zone device and drm client is not initialized, causing app to create kernel panic.
v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create. As the previous version has the potential of creating DRM client twice.
v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA before SDMA init. Adding the condition to in drm client creation, on top of v1, to guard against drm client creation call multiple times.
Signed-off-by: Ahmad Rehman <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e1f6746f |
| 22-Feb-2024 |
Lijo Lazar <[email protected]> |
drm/amdkfd: Skip packet submission on fatal error
If fatal error is detected, packet submission won't go through. Return error in such cases. Also, avoid waiting for fence when fatal error is detect
drm/amdkfd: Skip packet submission on fatal error
If fatal error is detected, packet submission won't go through. Return error in such cases. Also, avoid waiting for fence when fatal error is detected.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
db2aad03 |
| 25-Jan-2024 |
Le Ma <[email protected]> |
drm/amdgpu: move the drm client creation behind drm device registration
This patch is to eliminate interrupt warning below:
"[drm] Fence fallback timer expired on ring sdma0.0".
An early vm pt c
drm/amdgpu: move the drm client creation behind drm device registration
This patch is to eliminate interrupt warning below:
"[drm] Fence fallback timer expired on ring sdma0.0".
An early vm pt clearing job is sent to SDMA ahead of interrupt enabled. And re-locating the drm client creation following after drm_dev_register looks like a more proper flow.
v2: wrap the drm client creation
Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM handles") Signed-off-by: Le Ma <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
c0125b84 |
| 25-Jan-2024 |
Le Ma <[email protected]> |
drm/amdgpu: move the drm client creation behind drm device registration
This patch is to eliminate interrupt warning below:
"[drm] Fence fallback timer expired on ring sdma0.0".
An early vm pt c
drm/amdgpu: move the drm client creation behind drm device registration
This patch is to eliminate interrupt warning below:
"[drm] Fence fallback timer expired on ring sdma0.0".
An early vm pt clearing job is sent to SDMA ahead of interrupt enabled. And re-locating the drm client creation following after drm_dev_register looks like a more proper flow.
v2: wrap the drm client creation
Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM handles") Signed-off-by: Le Ma <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ed1e1e42 |
| 23-Jan-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks to SRIOV.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang
drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks to SRIOV.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|