amdgpu_job.c - OpenGrok history log for /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13
# bd22e44a	15-Jan-2025	Christian König <[email protected]>	drm/amdgpu: rework how isolation is enforced v2 Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require mult drm/amdgpu: rework how isolation is enforced v2 Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require multiple VMIDs to work correctly. So instead start to track all submissions to the relevant engines in a per partition data structure and use the dma_fences of the submissions to enforce isolation similar to what a VMID limit does. v2: use ~0l for jobs without isolation to distinct it from kernel submissions which uses NULL for the owner. Add some warning when we are OOM. Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 099f273e	25-Feb-2025	André Almeida <[email protected]>	drm/amdgpu: Trigger a wedged event for ring reset Instead of only triggering a wedged event for complete GPU resets, trigger for ring resets. Regardless of the reset, it's useful for userspace to kn drm/amdgpu: Trigger a wedged event for ring reset Instead of only triggering a wedged event for complete GPU resets, trigger for ring resets. Regardless of the reset, it's useful for userspace to know that it happened because the kernel will reject further submissions from that app. Reviewed-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 9c696cc5	26-Feb-2025	André Almeida <[email protected]>	drm/amdgpu: Create a debug option to disable ring reset Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we ha drm/amdgpu: Create a debug option to disable ring reset Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we have ring reset, create a debug option to disable them in amdgpu, forcing the driver to go with the full device reset path again when both options are combined. This option is useful for testing and debugging purposes when one wants to test the full reset from userspace. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# b7fd6528	20-Feb-2025	André Almeida <[email protected]>	drm/amdgpu: Log after a successful ring reset When a ring reset happens, the kernel log shows only "amdgpu: Starting <ring name> ring reset", but when it finishes nothing appears in the log. Explici drm/amdgpu: Log after a successful ring reset When a ring reset happens, the kernel log shows only "amdgpu: Starting <ring name> ring reset", but when it finishes nothing appears in the log. Explicitly write in the log that the reset has finished correctly. Reviewed-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# c94943b0	21-Feb-2025	[email protected] <[email protected]>	drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it skips error handling and fence completion. v2: move the is_guilty check down into the queue reset area (Alex) v3: need to call is_guilty before reset (Alex) v4: squash in is_guilty logic fixes (Alex) Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 80b6ef8a	21-Feb-2025	Tvrtko Ursulin <[email protected]>	drm/amdgpu: Pop jobs from the queue more robustly Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop. This allows breaking the hidden depende drm/amdgpu: Pop jobs from the queue more robustly Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop. This allows breaking the hidden dependency that queue_node has to be the first element in struct drm_sched_job. A comment is also added with a reference to the mailing list discussion explaining the copied helper will be removed when the whole broken amdgpu_job_stop_all_jobs_on_sched is removed. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Cc: Zhang, Hawking <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] show more ...
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# a93b1020	06-Dec-2024	Pierre-Eric Pelloux-Prayer <[email protected]>	drm/amdgpu: don't access invalid sched Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (jo drm/amdgpu: don't access invalid sched Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (job)->base.sched done in amdgpu_job_alloc is overwritten by the memset. This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus. To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function. The next commit will remove the ring argument completely. Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 2ae520cb12831d264ceb97c61f72c59d33c0dbd7) show more ...
# 11815bb0	12-Dec-2024	Christian König <[email protected]>	drm/amdgpu: partially revert "reduce reset time" This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00. This commit introduced a new state variable into adev without even remotely drm/amdgpu: partially revert "reduce reset time" This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00. This commit introduced a new state variable into adev without even remotely worrying about CPU barriers. Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 26c95e83	12-Dec-2024	Christian König <[email protected]>	drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare As soon as the prepare phase is completed the VM might be released, better set it to NULL. Signed-off-by: Christian König <christian.koe drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare As soon as the prepare phase is completed the VM might be released, better set it to NULL. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 54a1b36d	06-Dec-2024	Pierre-Eric Pelloux-Prayer <[email protected]>	drm/amdgpu: remove useless init from amdgpu_job_alloc This init is useless because base.sched will be cleared to 0 in drm_sched_job_init because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in drm/amdgpu: remove useless init from amdgpu_job_alloc This init is useless because base.sched will be cleared to 0 in drm_sched_job_init because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()"). Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 0014952b	06-Dec-2024	Pierre-Eric Pelloux-Prayer <[email protected]>	drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free It's unused. Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <alexander.deuc drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free It's unused. Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 2ae520cb	06-Dec-2024	Pierre-Eric Pelloux-Prayer <[email protected]>	drm/amdgpu: don't access invalid sched Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (jo drm/amdgpu: don't access invalid sched Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (job)->base.sched done in amdgpu_job_alloc is overwritten by the memset. This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus. To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function. The next commit will remove the ring argument completely. Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4
# 35984fd4	15-Oct-2024	Alex Deucher <[email protected]>	drm/amdgpu: add ring reset messages Add messages to make it clear when a per ring reset happens. This is helpful for debugging and aligns with other reset methods. v2: add ring name in success/fai drm/amdgpu: add ring reset messages Add messages to make it clear when a per ring reset happens. This is helpful for debugging and aligns with other reset methods. v2: add ring name in success/fail messages (Lijo) Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Kent Russell <[email protected]> (v1) Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.12-rc3, v6.12-rc2, v6.12-rc1
# 89cfa73b	24-Sep-2024	Tvrtko Ursulin <[email protected]>	drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job While loop makes it sound like amdgpu_vmid_grab() potentially needs to be called multiple times to produce a fence, while in reality all drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job While loop makes it sound like amdgpu_vmid_grab() potentially needs to be called multiple times to produce a fence, while in reality all code paths either return an error, assign a valid job->vmid or assign a vmid which will be valid once the returned fence signals. Therefore we can remove the loop to make it clear the call does not need to be repeated. Reviewed-by: Christian König <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 871f44b4	24-Sep-2024	Tvrtko Ursulin <[email protected]>	drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job Fence has been initialised to NULL so no need to test it. Reviewed-by: Christian König <[email protected]> Signed-off-by: Tv drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job Fence has been initialised to NULL so no need to test it. Reviewed-by: Christian König <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# fa73462d	24-Sep-2024	Sunil Khatri <[email protected]>	drm/amdgpu: update the handle ptr in dump_ip_state Update the ptr handle to amdgpu_ip_block ptr in all the functions. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König drm/amdgpu: update the handle ptr in dump_ip_state Update the ptr handle to amdgpu_ip_block ptr in all the functions. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# e1d27f7a	19-Sep-2024	ZhenGuo Yin <[email protected]>	drm/amdgpu: skip coredump after job timeout in SRIOV VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary. drm/amdgpu: skip coredump after job timeout in SRIOV VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary. Signed-off-by: ZhenGuo Yin <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.11, v6.11-rc7, v6.11-rc6
# b2ef8087	26-Aug-2024	Christian König <[email protected]>	drm/sched: add optional errno to drm_sched_start() The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm drm/sched: add optional errno to drm_sched_start() The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex) v6 (chk): rebase on upstream changes, cleanup the commit message, drop the new function again and update all callers, apply the errno also to scheduler fences with hw fences v7 (chk): rebased Signed-off-by: Jesse Zhang <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Christian König <[email protected]> Acked-by: Daniel Vetter <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] show more ...
# 30e8f4c2	28-Aug-2024	Sunil Khatri <[email protected]>	drm/amdgpu: Move the dumping log out of for loop log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete. Hence moving it out of the for loop. Signed-o drm/amdgpu: Move the dumping log out of for loop log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete. Hence moving it out of the for loop. Signed-off-by: Sunil Khatri <[email protected]> Acked-by: Trigger Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.11-rc5
# c67db6a6	19-Aug-2024	Trigger Huang <[email protected]>	drm/amdgpu: Do core dump immediately when job tmo Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the drm/amdgpu: Do core dump immediately when job tmo Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) V3: Unconditionally call the core dump as we care about all the reset functions(soft-recovery and queue reset and full adapter reset, Alex) V4: Do the dump after adev->job_hang = true (Sunil) Signed-off-by: Trigger Huang <[email protected]> Acked-by: Sunil Khatri <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4
# fb0a5834	12-Jun-2024	Prike Liang <[email protected]>	drm/amdgpu: increase the reset counter for the queue reset Update the reset counter for the amdgpu_cs_query_reset_state() Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Prike Lian drm/amdgpu: increase the reset counter for the queue reset Update the reset counter for the amdgpu_cs_query_reset_state() Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Prike Liang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.10-rc3
# 15789fa0	03-Jun-2024	Alex Deucher <[email protected]>	drm/amdgpu: add per ring reset support (v5) If a specific job is hung, try and reset just the ring associated with the job. v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring rese drm/amdgpu: add per ring reset support (v5) If a specific job is hung, try and reset just the ring associated with the job. v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring reset fails v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and drm_sched_increase_karma() v5: rework sched_stop handling Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8
# 829798c7	07-Mar-2024	Joshua Ashton <[email protected]>	drm/amdgpu: Forward soft recovery errors to userspace As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting drm/amdgpu: Forward soft recovery errors to userspace As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Joshua Ashton <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01) Cc: [email protected] show more ...
# 434967aa	07-Mar-2024	Joshua Ashton <[email protected]>	drm/amdgpu: Forward soft recovery errors to userspace As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting drm/amdgpu: Forward soft recovery errors to userspace As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Joshua Ashton <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 7d570f56	08-Jul-2024	Alex Deucher <[email protected]>	drm/amdgpu/job: Replace DRM_INFO/ERROR logging Use the dev_info/err variants so we get per device logging in multi-GPU cases. Reviewed-by: Christian König <[email protected]> Signed-off-by: drm/amdgpu/job: Replace DRM_INFO/ERROR logging Use the dev_info/err variants so we get per device logging in multi-GPU cases. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
12 3 4 5 6 7