|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
bd22e44a |
| 15-Jan-2025 |
Christian König <[email protected]> |
drm/amdgpu: rework how isolation is enforced v2
Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require mult
drm/amdgpu: rework how isolation is enforced v2
Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require multiple VMIDs to work correctly.
So instead start to track all submissions to the relevant engines in a per partition data structure and use the dma_fences of the submissions to enforce isolation similar to what a VMID limit does.
v2: use ~0l for jobs without isolation to distinct it from kernel submissions which uses NULL for the owner. Add some warning when we are OOM.
Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
099f273e |
| 25-Feb-2025 |
André Almeida <[email protected]> |
drm/amdgpu: Trigger a wedged event for ring reset
Instead of only triggering a wedged event for complete GPU resets, trigger for ring resets. Regardless of the reset, it's useful for userspace to kn
drm/amdgpu: Trigger a wedged event for ring reset
Instead of only triggering a wedged event for complete GPU resets, trigger for ring resets. Regardless of the reset, it's useful for userspace to know that it happened because the kernel will reject further submissions from that app.
Reviewed-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
9c696cc5 |
| 26-Feb-2025 |
André Almeida <[email protected]> |
drm/amdgpu: Create a debug option to disable ring reset
Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we ha
drm/amdgpu: Create a debug option to disable ring reset
Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we have ring reset, create a debug option to disable them in amdgpu, forcing the driver to go with the full device reset path again when both options are combined.
This option is useful for testing and debugging purposes when one wants to test the full reset from userspace.
Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b7fd6528 |
| 20-Feb-2025 |
André Almeida <[email protected]> |
drm/amdgpu: Log after a successful ring reset
When a ring reset happens, the kernel log shows only "amdgpu: Starting <ring name> ring reset", but when it finishes nothing appears in the log. Explici
drm/amdgpu: Log after a successful ring reset
When a ring reset happens, the kernel log shows only "amdgpu: Starting <ring name> ring reset", but when it finishes nothing appears in the log. Explicitly write in the log that the reset has finished correctly.
Reviewed-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
c94943b0 |
| 21-Feb-2025 |
[email protected] <[email protected]> |
drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty
This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it
drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty
This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it skips error handling and fence completion.
v2: move the is_guilty check down into the queue reset area (Alex) v3: need to call is_guilty before reset (Alex) v4: squash in is_guilty logic fixes (Alex)
Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
80b6ef8a |
| 21-Feb-2025 |
Tvrtko Ursulin <[email protected]> |
drm/amdgpu: Pop jobs from the queue more robustly
Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop.
This allows breaking the hidden depende
drm/amdgpu: Pop jobs from the queue more robustly
Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop.
This allows breaking the hidden dependency that queue_node has to be the first element in struct drm_sched_job.
A comment is also added with a reference to the mailing list discussion explaining the copied helper will be removed when the whole broken amdgpu_job_stop_all_jobs_on_sched is removed.
Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Cc: Zhang, Hawking <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
a93b1020 |
| 06-Dec-2024 |
Pierre-Eric Pelloux-Prayer <[email protected]> |
drm/amdgpu: don't access invalid sched
Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*jo
drm/amdgpu: don't access invalid sched
Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the memset.
This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus.
To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function.
The next commit will remove the ring argument completely.
Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 2ae520cb12831d264ceb97c61f72c59d33c0dbd7)
show more ...
|
| #
11815bb0 |
| 12-Dec-2024 |
Christian König <[email protected]> |
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely worrying about CPU barriers.
Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that.
Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
26c95e83 |
| 12-Dec-2024 |
Christian König <[email protected]> |
drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare
As soon as the prepare phase is completed the VM might be released, better set it to NULL.
Signed-off-by: Christian König <christian.koe
drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare
As soon as the prepare phase is completed the VM might be released, better set it to NULL.
Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
54a1b36d |
| 06-Dec-2024 |
Pierre-Eric Pelloux-Prayer <[email protected]> |
drm/amdgpu: remove useless init from amdgpu_job_alloc
This init is useless because base.sched will be cleared to 0 in drm_sched_job_init because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in
drm/amdgpu: remove useless init from amdgpu_job_alloc
This init is useless because base.sched will be cleared to 0 in drm_sched_job_init because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()").
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
0014952b |
| 06-Dec-2024 |
Pierre-Eric Pelloux-Prayer <[email protected]> |
drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free
It's unused.
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <alexander.deuc
drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free
It's unused.
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
2ae520cb |
| 06-Dec-2024 |
Pierre-Eric Pelloux-Prayer <[email protected]> |
drm/amdgpu: don't access invalid sched
Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*jo
drm/amdgpu: don't access invalid sched
Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the memset.
This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus.
To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function.
The next commit will remove the ring argument completely.
Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4 |
|
| #
35984fd4 |
| 15-Oct-2024 |
Alex Deucher <[email protected]> |
drm/amdgpu: add ring reset messages
Add messages to make it clear when a per ring reset happens. This is helpful for debugging and aligns with other reset methods.
v2: add ring name in success/fai
drm/amdgpu: add ring reset messages
Add messages to make it clear when a per ring reset happens. This is helpful for debugging and aligns with other reset methods.
v2: add ring name in success/fail messages (Lijo)
Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Kent Russell <[email protected]> (v1) Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
89cfa73b |
| 24-Sep-2024 |
Tvrtko Ursulin <[email protected]> |
drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job
While loop makes it sound like amdgpu_vmid_grab() potentially needs to be called multiple times to produce a fence, while in reality all
drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job
While loop makes it sound like amdgpu_vmid_grab() potentially needs to be called multiple times to produce a fence, while in reality all code paths either return an error, assign a valid job->vmid or assign a vmid which will be valid once the returned fence signals.
Therefore we can remove the loop to make it clear the call does not need to be repeated.
Reviewed-by: Christian König <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
871f44b4 |
| 24-Sep-2024 |
Tvrtko Ursulin <[email protected]> |
drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job
Fence has been initialised to NULL so no need to test it.
Reviewed-by: Christian König <[email protected]> Signed-off-by: Tv
drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job
Fence has been initialised to NULL so no need to test it.
Reviewed-by: Christian König <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fa73462d |
| 24-Sep-2024 |
Sunil Khatri <[email protected]> |
drm/amdgpu: update the handle ptr in dump_ip_state
Update the ptr handle to amdgpu_ip_block ptr in all the functions.
Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König
drm/amdgpu: update the handle ptr in dump_ip_state
Update the ptr handle to amdgpu_ip_block ptr in all the functions.
Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e1d27f7a |
| 19-Sep-2024 |
ZhenGuo Yin <[email protected]> |
drm/amdgpu: skip coredump after job timeout in SRIOV
VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary.
drm/amdgpu: skip coredump after job timeout in SRIOV
VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary.
Signed-off-by: ZhenGuo Yin <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
b2ef8087 |
| 26-Aug-2024 |
Christian König <[email protected]> |
drm/sched: add optional errno to drm_sched_start()
The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm
drm/sched: add optional errno to drm_sched_start()
The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used.
To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA.
v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex) v6 (chk): rebase on upstream changes, cleanup the commit message, drop the new function again and update all callers, apply the errno also to scheduler fences with hw fences v7 (chk): rebased
Signed-off-by: Jesse Zhang <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Christian König <[email protected]> Acked-by: Daniel Vetter <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
show more ...
|
| #
30e8f4c2 |
| 28-Aug-2024 |
Sunil Khatri <[email protected]> |
drm/amdgpu: Move the dumping log out of for loop
log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete.
Hence moving it out of the for loop.
Signed-o
drm/amdgpu: Move the dumping log out of for loop
log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete.
Hence moving it out of the for loop.
Signed-off-by: Sunil Khatri <[email protected]> Acked-by: Trigger Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5 |
|
| #
c67db6a6 |
| 19-Aug-2024 |
Trigger Huang <[email protected]> |
drm/amdgpu: Do core dump immediately when job tmo
Do the coredump immediately after a job timeout to get a closer representation of GPU's error status.
V2: This will skip printing vram_lost as the
drm/amdgpu: Do core dump immediately when job tmo
Do the coredump immediately after a job timeout to get a closer representation of GPU's error status.
V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex)
V3: Unconditionally call the core dump as we care about all the reset functions(soft-recovery and queue reset and full adapter reset, Alex)
V4: Do the dump after adev->job_hang = true (Sunil)
Signed-off-by: Trigger Huang <[email protected]> Acked-by: Sunil Khatri <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4 |
|
| #
fb0a5834 |
| 12-Jun-2024 |
Prike Liang <[email protected]> |
drm/amdgpu: increase the reset counter for the queue reset
Update the reset counter for the amdgpu_cs_query_reset_state()
Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Prike Lian
drm/amdgpu: increase the reset counter for the queue reset
Update the reset counter for the amdgpu_cs_query_reset_state()
Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Prike Liang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3 |
|
| #
15789fa0 |
| 03-Jun-2024 |
Alex Deucher <[email protected]> |
drm/amdgpu: add per ring reset support (v5)
If a specific job is hung, try and reset just the ring associated with the job.
v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring rese
drm/amdgpu: add per ring reset support (v5)
If a specific job is hung, try and reset just the ring associated with the job.
v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring reset fails v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and drm_sched_increase_karma() v5: rework sched_stop handling
Acked-by: Vitaly Prosyak <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8 |
|
| #
829798c7 |
| 07-Mar-2024 |
Joshua Ashton <[email protected]> |
drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting
drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
1: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Joshua Ashton <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01) Cc: [email protected]
show more ...
|
| #
434967aa |
| 07-Mar-2024 |
Joshua Ashton <[email protected]> |
drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting
drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
1: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Joshua Ashton <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
7d570f56 |
| 08-Jul-2024 |
Alex Deucher <[email protected]> |
drm/amdgpu/job: Replace DRM_INFO/ERROR logging
Use the dev_info/err variants so we get per device logging in multi-GPU cases.
Reviewed-by: Christian König <[email protected]> Signed-off-by:
drm/amdgpu/job: Replace DRM_INFO/ERROR logging
Use the dev_info/err variants so we get per device logging in multi-GPU cases.
Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|