History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c (Results 1 – 25 of 225)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3
# 9424a5bf 10-Feb-2025 Jonathan Kim <[email protected]>

drm/amdgpu: simplify xgmi peer info calls

Deprecate KFD XGMI peer info calls in favour of calling directly from
simplified XGMI peer info functions.

Signed-off-by: Jonathan Kim <[email protected]

drm/amdgpu: simplify xgmi peer info calls

Deprecate KFD XGMI peer info calls in favour of calling directly from
simplified XGMI peer info functions.

Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1
# 8b0d068e 30-Jan-2025 Alex Deucher <[email protected]>

drm/amdkfd: add a new flag to manage where VRAM allocations go

On big and small APUs we send KFD VRAM allocations to GTT
since the carve out is either non-existent or relatively
small. However, if

drm/amdkfd: add a new flag to manage where VRAM allocations go

On big and small APUs we send KFD VRAM allocations to GTT
since the carve out is either non-existent or relatively
small. However, if someone sets the carve out size to be
relatively large, we may end up using GTT rather than VRAM.

No change of logic with this patch, but it allows the
driver to determine which logic to use based on the
carve out size in the future.

Reviewed-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7
# 90505894 09-Jan-2025 Kenneth Feng <[email protected]>

drm/amdgpu: disable gfxoff with the compute workload on gfx12

Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <kenne

drm/amdgpu: disable gfxoff with the compute workload on gfx12

Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 2affe2bbc997b3920045c2c434e480c81a5f9707)
Cc: [email protected] # 6.12.x

show more ...


# 2affe2bb 09-Jan-2025 Kenneth Feng <[email protected]>

drm/amdgpu: disable gfxoff with the compute workload on gfx12

Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <kenne

drm/amdgpu: disable gfxoff with the compute workload on gfx12

Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# 11815bb0 12-Dec-2024 Christian König <[email protected]>

drm/amdgpu: partially revert "reduce reset time"

This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.

This commit introduced a new state variable into adev without even
remotely

drm/amdgpu: partially revert "reduce reset time"

This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.

This commit introduced a new state variable into adev without even
remotely worrying about CPU barriers.

Since we already have the amdgpu_in_reset() function exactly for this
use case partially revert that.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 357ef5b3 10-Dec-2024 Andrew Martin <[email protected]>

drm/amdgpu: Failed to check various return code

Clean up code to quiet the compiler on us failing to check the return
code.

Signed-off-by: Andrew Martin <[email protected]>
Reviewed-by: Harish

drm/amdgpu: Failed to check various return code

Clean up code to quiet the compiler on us failing to check the return
code.

Signed-off-by: Andrew Martin <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1
# 80d80511 29-Sep-2024 Boyuan Zhang <[email protected]>

drm/amdgpu: pass ip_block in set_powergating_state

Pass ip_block instead of adev in set_powergating_state callback function.
Modify set_powergating_state ip functions for all correspoding ip blocks.

drm/amdgpu: pass ip_block in set_powergating_state

Pass ip_block instead of adev in set_powergating_state callback function.
Modify set_powergating_state ip functions for all correspoding ip blocks.

v2: fix a ip block index error.

v3: remove type casting

Signed-off-by: Boyuan Zhang <[email protected]>
Suggested-by: Christian König <[email protected]>
Acked-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Sunil Khatri <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# fa317985 05-Nov-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Fix map/unmap queue logic

In current logic, it calls ring_alloc followed by a ring_test. ring_test
in turn will call another ring_alloc. This is illegal usage as a
ring_alloc is expected

drm/amdgpu: Fix map/unmap queue logic

In current logic, it calls ring_alloc followed by a ring_test. ring_test
in turn will call another ring_alloc. This is illegal usage as a
ring_alloc is expected to be closed properly with a ring_commit. Change
to commit the map/unmap queue packet first followed by a ring_test. Add a
comment about the usage of ring_test.

Also, reorder the current pre-condition checks of job hang or kiq ring
scheduler not ready. Without them being met, it is not useful to attempt
ring or memory allocations.

Fixes tag refers to the original patch which introduced this issue which
then got carried over into newer code.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Fixes: 6c10b5cc4eaa ("drm/amdgpu: Remove duplicate code in gfx_v8_0.c")

show more ...


# 8fe7cf58 14-Oct-2024 Alex Deucher <[email protected]>

drm/amdkfd: add an interface to query whether is KFD is active

Add an interface to query whether KFD has any active queues.

v2: fix build issues

Acked-by: Srinivasan Shanmugam <srinivasan.shanmuga

drm/amdkfd: add an interface to query whether is KFD is active

Add an interface to query whether KFD has any active queues.

v2: fix build issues

Acked-by: Srinivasan Shanmugam <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11
# 3eebfd5e 12-Sep-2024 Feifei Xu <[email protected]>

drm/amdkfd:Add kfd function to config sq perfmon

Expose the interface for kfd to config sq perfmon.

Signed-off-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
R

drm/amdkfd:Add kfd function to config sq perfmon

Expose the interface for kfd to config sq perfmon.

Signed-off-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: James Zhu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5
# b05d6476 19-Aug-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Retire query_utcl2_poison_status callback

Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.

Signed-off-by: Hawking Zhang <Hawking.Z

drm/amdgpu: Retire query_utcl2_poison_status callback

Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2
# 234eebe1 29-Jul-2024 Amber Lin <[email protected]>

drm/amdkfd: APIs to stop/start KFD scheduling

Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgp

drm/amdkfd: APIs to stop/start KFD scheduling

Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)

Signed-off-by: Amber Lin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10
# c86ad391 14-Jul-2024 Philip Yang <[email protected]>

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original poi

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3
# dbe2c4c8 03-Jun-2024 Eric Huang <[email protected]>

drm/amdkfd: add reset cause in gpu pre-reset smi event

reset cause is requested by customer as additional
info for gpu reset smi event.

v2: integerate reset sources suggested by Lijo Lazar

Signed-

drm/amdkfd: add reset cause in gpu pre-reset smi event

reset cause is requested by customer as additional
info for gpu reset smi event.

v2: integerate reset sources suggested by Lijo Lazar

Signed-off-by: Eric Huang <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6
# 89773b85 26-Apr-2024 Lang Yu <[email protected]>

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
m

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
memory allocation requirements.

We can't even run a Basic MNIST Example with a default 512MB carveout.
https://github.com/pytorch/examples/tree/main/mnist. Error Log:

"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate
84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes
is free. Of the allocated memory 103.83 MiB is allocated by PyTorch,
and 22.17 MiB is reserved by PyTorch but unallocated"

Though we can change BIOS settings to enlarge carveout size,
which is inflexible and may bring complaint. On the other hand,
the memory resource can't be effectively used between host and device.

The solution is MI300A approach, i.e., let VRAM allocations go to GTT.
Then device and host can flexibly and effectively share memory resource.

v2: Report local_mem_size_private as 0. (Felix)

Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# eb853413 26-Apr-2024 Lang Yu <[email protected]>

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
m

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
memory allocation requirements.

We can't even run a Basic MNIST Example with a default 512MB carveout.
https://github.com/pytorch/examples/tree/main/mnist. Error Log:

"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate
84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes
is free. Of the allocated memory 103.83 MiB is allocated by PyTorch,
and 22.17 MiB is reserved by PyTorch but unallocated"

Though we can change BIOS settings to enlarge carveout size,
which is inflexible and may bring complaint. On the other hand,
the memory resource can't be effectively used between host and device.

The solution is MI300A approach, i.e., let VRAM allocations go to GTT.
Then device and host can flexibly and effectively share memory resource.

v2: Report local_mem_size_private as 0. (Felix)

Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# bfa579b3 22-Apr-2024 YiPeng Chai <[email protected]>

drm/amdgpu: prepare to handle pasid poison consumption

Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-

drm/amdgpu: prepare to handle pasid poison consumption

Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 2fc46e0b 12-Mar-2024 Tao Zhou <[email protected]>

drm/amdgpu: make reset method configurable for RAS poison

Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling

drm/amdgpu: make reset method configurable for RAS poison

Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.

v2: remove the mmhub poison support for kfd int v10.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d8070c42 11-Mar-2024 Tao Zhou <[email protected]>

drm/amdgpu: support utcl2 RAS poison query for mmhub

Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking

drm/amdgpu: support utcl2 RAS poison query for mmhub

Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7, v6.8-rc6
# 71a8d61e 19-Feb-2024 Tao Zhou <[email protected]>

drm/amdgpu: retire gfx ras query_utcl2_poison_status

Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
get node id for query_utcl2_poison_status

Signed-of

drm/amdgpu: retire gfx ras query_utcl2_poison_status

Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
get node id for query_utcl2_poison_status

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f679fd60 04-Mar-2024 Ahmad Rehman <[email protected]>

drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, T

drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in this case, the zone device and
drm client is not initialized, causing app to create kernel panic.

v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create.
As the previous version has the potential of creating DRM client twice.

v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA
before SDMA init. Adding the condition to in drm client creation, on top of v1,
to guard against drm client creation call multiple times.

Signed-off-by: Ahmad Rehman <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e1f6746f 22-Feb-2024 Lijo Lazar <[email protected]>

drm/amdkfd: Skip packet submission on fatal error

If fatal error is detected, packet submission won't go through. Return
error in such cases. Also, avoid waiting for fence when fatal error is
detect

drm/amdkfd: Skip packet submission on fatal error

If fatal error is detected, packet submission won't go through. Return
error in such cases. Also, avoid waiting for fence when fatal error is
detected.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2
# db2aad03 25-Jan-2024 Le Ma <[email protected]>

drm/amdgpu: move the drm client creation behind drm device registration

This patch is to eliminate interrupt warning below:

"[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt c

drm/amdgpu: move the drm client creation behind drm device registration

This patch is to eliminate interrupt warning below:

"[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled.
And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

v2: wrap the drm client creation

Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM handles")
Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# c0125b84 25-Jan-2024 Le Ma <[email protected]>

drm/amdgpu: move the drm client creation behind drm device registration

This patch is to eliminate interrupt warning below:

"[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt c

drm/amdgpu: move the drm client creation behind drm device registration

This patch is to eliminate interrupt warning below:

"[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled.
And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

v2: wrap the drm client creation

Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM handles")
Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# ed1e1e42 23-Jan-2024 YiPeng Chai <[email protected]>

drm/amdgpu: Support passing poison consumption ras block to SRIOV

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang

drm/amdgpu: Support passing poison consumption ras block to SRIOV

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


123456789