History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c (Results 1 – 25 of 168)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13
# bd22e44a 15-Jan-2025 Christian König <[email protected]>

drm/amdgpu: rework how isolation is enforced v2

Limiting the number of available VMIDs to enforce isolation causes some
issues with gang submit and applying certain HW workarounds which
require mult

drm/amdgpu: rework how isolation is enforced v2

Limiting the number of available VMIDs to enforce isolation causes some
issues with gang submit and applying certain HW workarounds which
require multiple VMIDs to work correctly.

So instead start to track all submissions to the relevant engines in a
per partition data structure and use the dma_fences of the submissions
to enforce isolation similar to what a VMID limit does.

v2: use ~0l for jobs without isolation to distinct it from kernel
submissions which uses NULL for the owner. Add some warning when we
are OOM.

Signed-off-by: Christian König <[email protected]>
Acked-by: Srinivasan Shanmugam <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 099f273e 25-Feb-2025 André Almeida <[email protected]>

drm/amdgpu: Trigger a wedged event for ring reset

Instead of only triggering a wedged event for complete GPU resets,
trigger for ring resets. Regardless of the reset, it's useful for
userspace to kn

drm/amdgpu: Trigger a wedged event for ring reset

Instead of only triggering a wedged event for complete GPU resets,
trigger for ring resets. Regardless of the reset, it's useful for
userspace to know that it happened because the kernel will reject
further submissions from that app.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 9c696cc5 26-Feb-2025 André Almeida <[email protected]>

drm/amdgpu: Create a debug option to disable ring reset

Prior to the addition of ring reset, the debug option
`debug_disable_soft_recovery` could be used to force a full device
reset. Now that we ha

drm/amdgpu: Create a debug option to disable ring reset

Prior to the addition of ring reset, the debug option
`debug_disable_soft_recovery` could be used to force a full device
reset. Now that we have ring reset, create a debug option to disable
them in amdgpu, forcing the driver to go with the full device
reset path again when both options are combined.

This option is useful for testing and debugging purposes when one wants
to test the full reset from userspace.

Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b7fd6528 20-Feb-2025 André Almeida <[email protected]>

drm/amdgpu: Log after a successful ring reset

When a ring reset happens, the kernel log shows only "amdgpu: Starting
<ring name> ring reset", but when it finishes nothing appears in the
log. Explici

drm/amdgpu: Log after a successful ring reset

When a ring reset happens, the kernel log shows only "amdgpu: Starting
<ring name> ring reset", but when it finishes nothing appears in the
log. Explicitly write in the log that the reset has finished correctly.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# c94943b0 21-Feb-2025 [email protected] <[email protected]>

drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it

drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it
skips error handling and fence completion.

v2: move the is_guilty check down into the queue reset area (Alex)
v3: need to call is_guilty before reset (Alex)
v4: squash in is_guilty logic fixes (Alex)

Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Jesse Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 80b6ef8a 21-Feb-2025 Tvrtko Ursulin <[email protected]>

drm/amdgpu: Pop jobs from the queue more robustly

Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly
added drm_sched_entity_queue_pop.

This allows breaking the hidden depende

drm/amdgpu: Pop jobs from the queue more robustly

Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly
added drm_sched_entity_queue_pop.

This allows breaking the hidden dependency that queue_node has to be the
first element in struct drm_sched_job.

A comment is also added with a reference to the mailing list discussion
explaining the copied helper will be removed when the whole broken
amdgpu_job_stop_all_jobs_on_sched is removed.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Philipp Stanner <[email protected]>
Cc: Zhang, Hawking <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Philipp Stanner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

show more ...


Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# a93b1020 06-Dec-2024 Pierre-Eric Pelloux-Prayer <[email protected]>

drm/amdgpu: don't access invalid sched

Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*jo

drm/amdgpu: don't access invalid sched

Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the
memset.

This commit fixes an issue when a CS would fail validation and would
be rejected after job->num_ibs is incremented. In this case,
amdgpu_ib_free(ring->adev, ...) will be called, which would crash the
machine because the ring value is bogus.

To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this
because the device is actually not used in this function.

The next commit will remove the ring argument completely.

Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 2ae520cb12831d264ceb97c61f72c59d33c0dbd7)

show more ...


# 11815bb0 12-Dec-2024 Christian König <[email protected]>

drm/amdgpu: partially revert "reduce reset time"

This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.

This commit introduced a new state variable into adev without even
remotely

drm/amdgpu: partially revert "reduce reset time"

This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.

This commit introduced a new state variable into adev without even
remotely worrying about CPU barriers.

Since we already have the amdgpu_in_reset() function exactly for this
use case partially revert that.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 26c95e83 12-Dec-2024 Christian König <[email protected]>

drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare

As soon as the prepare phase is completed the VM might be released,
better set it to NULL.

Signed-off-by: Christian König <christian.koe

drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare

As soon as the prepare phase is completed the VM might be released,
better set it to NULL.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 54a1b36d 06-Dec-2024 Pierre-Eric Pelloux-Prayer <[email protected]>

drm/amdgpu: remove useless init from amdgpu_job_alloc

This init is useless because base.sched will be cleared to 0 in drm_sched_job_init
because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in

drm/amdgpu: remove useless init from amdgpu_job_alloc

This init is useless because base.sched will be cleared to 0 in drm_sched_job_init
because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()").

Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 0014952b 06-Dec-2024 Pierre-Eric Pelloux-Prayer <[email protected]>

drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free

It's unused.

Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
Reviewed-by: Alex Deucher <alexander.deuc

drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free

It's unused.

Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 2ae520cb 06-Dec-2024 Pierre-Eric Pelloux-Prayer <[email protected]>

drm/amdgpu: don't access invalid sched

Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*jo

drm/amdgpu: don't access invalid sched

Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the
memset.

This commit fixes an issue when a CS would fail validation and would
be rejected after job->num_ibs is incremented. In this case,
amdgpu_ib_free(ring->adev, ...) will be called, which would crash the
machine because the ring value is bogus.

To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this
because the device is actually not used in this function.

The next commit will remove the ring argument completely.

Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()")
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4
# 35984fd4 15-Oct-2024 Alex Deucher <[email protected]>

drm/amdgpu: add ring reset messages

Add messages to make it clear when a per ring reset
happens. This is helpful for debugging and aligns with
other reset methods.

v2: add ring name in success/fai

drm/amdgpu: add ring reset messages

Add messages to make it clear when a per ring reset
happens. This is helpful for debugging and aligns with
other reset methods.

v2: add ring name in success/fail messages (Lijo)

Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: Kent Russell <[email protected]> (v1)
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc3, v6.12-rc2, v6.12-rc1
# 89cfa73b 24-Sep-2024 Tvrtko Ursulin <[email protected]>

drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job

While loop makes it sound like amdgpu_vmid_grab() potentially needs to be
called multiple times to produce a fence, while in reality all

drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job

While loop makes it sound like amdgpu_vmid_grab() potentially needs to be
called multiple times to produce a fence, while in reality all code paths
either return an error, assign a valid job->vmid or assign a vmid which
will be valid once the returned fence signals.

Therefore we can remove the loop to make it clear the call does not need
to be repeated.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 871f44b4 24-Sep-2024 Tvrtko Ursulin <[email protected]>

drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job

Fence has been initialised to NULL so no need to test it.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Tv

drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job

Fence has been initialised to NULL so no need to test it.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# fa73462d 24-Sep-2024 Sunil Khatri <[email protected]>

drm/amdgpu: update the handle ptr in dump_ip_state

Update the ptr handle to amdgpu_ip_block ptr in all
the functions.

Signed-off-by: Sunil Khatri <[email protected]>
Reviewed-by: Christian König

drm/amdgpu: update the handle ptr in dump_ip_state

Update the ptr handle to amdgpu_ip_block ptr in all
the functions.

Signed-off-by: Sunil Khatri <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e1d27f7a 19-Sep-2024 ZhenGuo Yin <[email protected]>

drm/amdgpu: skip coredump after job timeout in SRIOV

VF FLR will be triggered by host driver before job timeout,
hence the error status of GPU get cleared. Performing a
coredump here is unnecessary.

drm/amdgpu: skip coredump after job timeout in SRIOV

VF FLR will be triggered by host driver before job timeout,
hence the error status of GPU get cleared. Performing a
coredump here is unnecessary.

Signed-off-by: ZhenGuo Yin <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11, v6.11-rc7, v6.11-rc6
# b2ef8087 26-Aug-2024 Christian König <[email protected]>

drm/sched: add optional errno to drm_sched_start()

The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm

drm/sched: add optional errno to drm_sched_start()

The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
and amdgpu_device_unlock_reset_domain to allow user mode to track
the queue reset status and distinguish between queue reset and
GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
and -ECANCELED or -ETIME for GPU reset, returned to
amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
drm_sched_start_ex with an additional parameter to set
dma_fence_set_error, allowing us to handle the specific error
codes appropriately and dispose of bad jobs with the selected
error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
which more accurately describes the function's purpose.
Additionally, it was recommended to add documentation details
about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
drop the new function again and update all callers,
apply the errno also to scheduler fences with hw fences
v7 (chk): rebased

Signed-off-by: Jesse Zhang <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Christian König <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

show more ...


# 30e8f4c2 28-Aug-2024 Sunil Khatri <[email protected]>

drm/amdgpu: Move the dumping log out of for loop

log message "Dumping IP State Completed" needs to
be logged only once when state dumping is complete.

Hence moving it out of the for loop.

Signed-o

drm/amdgpu: Move the dumping log out of for loop

log message "Dumping IP State Completed" needs to
be logged only once when state dumping is complete.

Hence moving it out of the for loop.

Signed-off-by: Sunil Khatri <[email protected]>
Acked-by: Trigger Huang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc5
# c67db6a6 19-Aug-2024 Trigger Huang <[email protected]>

drm/amdgpu: Do core dump immediately when job tmo

Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the

drm/amdgpu: Do core dump immediately when job tmo

Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the GPU reset is not
happened yet (Alex)

V3: Unconditionally call the core dump as we care about all the reset
functions(soft-recovery and queue reset and full adapter reset, Alex)

V4: Do the dump after adev->job_hang = true (Sunil)

Signed-off-by: Trigger Huang <[email protected]>
Acked-by: Sunil Khatri <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4
# fb0a5834 12-Jun-2024 Prike Liang <[email protected]>

drm/amdgpu: increase the reset counter for the queue reset

Update the reset counter for the amdgpu_cs_query_reset_state()

Acked-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Prike Lian

drm/amdgpu: increase the reset counter for the queue reset

Update the reset counter for the amdgpu_cs_query_reset_state()

Acked-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Prike Liang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc3
# 15789fa0 03-Jun-2024 Alex Deucher <[email protected]>

drm/amdgpu: add per ring reset support (v5)

If a specific job is hung, try and reset just the
ring associated with the job.

v2: move to amdgpu_job.c
v3: fix drm_sched_stop() handling when ring rese

drm/amdgpu: add per ring reset support (v5)

If a specific job is hung, try and reset just the
ring associated with the job.

v2: move to amdgpu_job.c
v3: fix drm_sched_stop() handling when ring reset fails
v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and
drm_sched_increase_karma()
v5: rework sched_stop handling

Acked-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8
# 829798c7 07-Mar-2024 Joshua Ashton <[email protected]>

drm/amdgpu: Forward soft recovery errors to userspace

As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting

drm/amdgpu: Forward soft recovery errors to userspace

As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Joshua Ashton <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Signed-off-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01)
Cc: [email protected]

show more ...


# 434967aa 07-Mar-2024 Joshua Ashton <[email protected]>

drm/amdgpu: Forward soft recovery errors to userspace

As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting

drm/amdgpu: Forward soft recovery errors to userspace

As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Joshua Ashton <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Signed-off-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 7d570f56 08-Jul-2024 Alex Deucher <[email protected]>

drm/amdgpu/job: Replace DRM_INFO/ERROR logging

Use the dev_info/err variants so we get per device logging
in multi-GPU cases.

Reviewed-by: Christian König <[email protected]>
Signed-off-by:

drm/amdgpu/job: Replace DRM_INFO/ERROR logging

Use the dev_info/err variants so we get per device logging
in multi-GPU cases.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


1234567