|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
447fab30 |
| 28-Mar-2025 |
Christian König <[email protected]> |
drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
Otherwise triggering sysfs multiple times without other submissions in between only runs the shader once.
v2: add some comment v
drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
Otherwise triggering sysfs multiple times without other submissions in between only runs the shader once.
v2: add some comment v3: re-add missing cast v4: squash in semicolon fix
Signed-off-by: Christian König <[email protected]> Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 8b2ae7d492675e8af8902f103364bef59382b935)
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
db1e58ec |
| 27-Jan-2025 |
Christian König <[email protected]> |
drm/amdgpu: stop reserving VMIDs to enforce isolation
That was quite troublesome for gang submit. Completely drop this approach and enforce the isolation separately.
Signed-off-by: Christian König
drm/amdgpu: stop reserving VMIDs to enforce isolation
That was quite troublesome for gang submit. Completely drop this approach and enforce the isolation separately.
Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
9e34d8d1 |
| 14-Mar-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: adjust workload profile handling
No need to make the workload profile setup dependent on the results of cancelling the delayed work thread. We have all of the necessary checking in p
drm/amdgpu/gfx: adjust workload profile handling
No need to make the workload profile setup dependent on the results of cancelling the delayed work thread. We have all of the necessary checking in place for the workload profile reference counting, so separate the two. As it is now, we can theoretically end up with the call from begin_use happening while the worker thread is executing which would result in the profile not getting set for that submission. It should not affect the reference counting.
v2: bail early if the the profile is already active (Lijo)
Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
553673a3 |
| 12-Mar-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: fix ref counting for ring based profile handling
We need to make sure the workload profile ref counts are balanced. This isn't currently the case because we can increment the count
drm/amdgpu/gfx: fix ref counting for ring based profile handling
We need to make sure the workload profile ref counts are balanced. This isn't currently the case because we can increment the count on submissions, but the decrement may be delayed as work comes in. Track when we enable the workload profile so the references are balanced.
v2: switch to a mutex and active flag v3: fix mutex init
Fixes: 8fdb3958e396 ("drm/amdgpu/gfx: add ring helpers for setting workload profile") Cc: Yang Wang <[email protected]> Cc: Kenneth Feng <[email protected]> Tested-by: Kenneth Feng <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
0ee560d7 |
| 10-Mar-2025 |
Dan Carpenter <[email protected]> |
drm/amdgpu/gfx: delete stray tabs
These lines are indented one tab too far. Delete the extra tabs.
Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Dan Carpenter <da
drm/amdgpu/gfx: delete stray tabs
These lines are indented one tab too far. Delete the extra tabs.
Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
748a1f51 |
| 14-Feb-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/mes: keep enforce isolation up to date
Re-send the mes message on resume to make sure the mes state is up to date.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES")
drm/amdgpu/mes: keep enforce isolation up to date
Re-send the mes message on resume to make sure the mes state is up to date.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 27b791514789844e80da990c456c2465325e0851)
show more ...
|
| #
e7ea8820 |
| 13-Feb-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: only call mes for enforce isolation if supported
This should not be called on chips without MES so check if MES is enabled and if the cleaner shader is supported.
Fixes: 8521e3c5f05
drm/amdgpu/gfx: only call mes for enforce isolation if supported
This should not be called on chips without MES so check if MES is enabled and if the cleaner shader is supported.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]> (cherry picked from commit 80513e389765c8f9543b26d8fa4bbdf0e59ff8bc)
show more ...
|
| #
27b79151 |
| 14-Feb-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/mes: keep enforce isolation up to date
Re-send the mes message on resume to make sure the mes state is up to date.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES")
drm/amdgpu/mes: keep enforce isolation up to date
Re-send the mes message on resume to make sure the mes state is up to date.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
80513e38 |
| 13-Feb-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: only call mes for enforce isolation if supported
This should not be called on chips without MES so check if MES is enabled and if the cleaner shader is supported.
Fixes: 8521e3c5f05
drm/amdgpu/gfx: only call mes for enforce isolation if supported
This should not be called on chips without MES so check if MES is enabled and if the cleaner shader is supported.
Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]>
show more ...
|
| #
250d9769 |
| 31-Jan-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: add amdgpu_gfx_off_ctrl_immediate()
Same as amdgpu_gfx_off_ctrl(), but without the delay for gfxoff disallow.
Reviewed-by: Lijo Lazar <[email protected]> Suggested-by: Błażej Szczy
drm/amdgpu/gfx: add amdgpu_gfx_off_ctrl_immediate()
Same as amdgpu_gfx_off_ctrl(), but without the delay for gfxoff disallow.
Reviewed-by: Lijo Lazar <[email protected]> Suggested-by: Błażej Szczygieł <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7 |
|
| #
8fdb3958 |
| 08-Jan-2025 |
Alex Deucher <[email protected]> |
drm/amdgpu/gfx: add ring helpers for setting workload profile
Add helpers to switch the workload profile dynamically when commands are submitted. This allows us to switch to the FULLSCREEN3D or COM
drm/amdgpu/gfx: add ring helpers for setting workload profile
Add helpers to switch the workload profile dynamically when commands are submitted. This allows us to switch to the FULLSCREEN3D or COMPUTE profile when work is submitted. Add a delayed work handler to delay switching out of the selected profile if additional work comes in. This works the same as the VIDEO profile for VCN. This lets dynamically enable workload profiles on the fly and then move back to the default when there is no work.
Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
1e8c193f |
| 09-Jan-2025 |
Srinivasan Shanmugam <[email protected]> |
drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warn
drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order.
- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency.
By ensuring consistent lock usage, this fix resolves the issue:
[ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock.
[ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this:
[ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex
[ 606.300458] Possible unsafe locking scenario:
[ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] *** DEADLOCK ***
[ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK>
v2: Refactor lock handling to resolve circular dependency (Alex)
- Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order.
Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 0b6b2dd38336d5fd49214f0e4e6495e658e3ab44) Cc: [email protected]
show more ...
|
| #
0b6b2dd3 |
| 09-Jan-2025 |
Srinivasan Shanmugam <[email protected]> |
drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warn
drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order.
- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency.
By ensuring consistent lock usage, this fix resolves the issue:
[ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock.
[ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this:
[ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex
[ 606.300458] Possible unsafe locking scenario:
[ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] *** DEADLOCK ***
[ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK>
v2: Refactor lock handling to resolve circular dependency (Alex)
- Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order.
Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
11815bb0 |
| 12-Dec-2024 |
Christian König <[email protected]> |
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely
drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00.
This commit introduced a new state variable into adev without even remotely worrying about CPU barriers.
Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that.
Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
34c4eb7d |
| 15-Dec-2024 |
Karol Przybylski <[email protected]> |
drm/amdgpu: Fix potential integer overflow in scheduler mask calculations
The use of 1 << i in scheduler mask calculations can result in an unintentional integer overflow due to the expression being
drm/amdgpu: Fix potential integer overflow in scheduler mask calculations
The use of 1 << i in scheduler mask calculations can result in an unintentional integer overflow due to the expression being evaluated as a 32-bit signed integer.
This patch replaces 1 << i with 1ULL << i to ensure the operation is performed as a 64-bit unsigned integer, preventing overflow
Discovered in coverity scan, CID 1636393, 1636175, 1636007, 1635853
Fixes: c5c63d9cb5d3 ("drm/amdgpu: add amdgpu_gfx_sched_mask and amdgpu_compute_sched_mask debugfs") Signed-off-by: Karol Przybylski <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
55f4139b |
| 27-Nov-2024 |
Srinivasan Shanmugam <[email protected]> |
drm/amd/amdgpu: Add Annotations to Process Isolation functions
This update adds explanations to key functions that manage how the Kernel Fusion Driver (KFD) and Kernel Graphics Driver (KGD) share th
drm/amd/amdgpu: Add Annotations to Process Isolation functions
This update adds explanations to key functions that manage how the Kernel Fusion Driver (KFD) and Kernel Graphics Driver (KGD) share the GPU.
amdgpu_gfx_enforce_isolation_wait_for_kfd: Controls the waiting period for KFD to ensure it takes turns with KGD in using the GPU. It uses a mutex to safely manage shared data, like timing and state, and tracks when KFD starts and stops waiting.
amdgpu_gfx_enforce_isolation_ring_begin_use: Ensures KFD has enough time to run before new tasks are submitted to the GPU ring. It uses a mutex to synchronize access and may adjust the KFD scheduler.
amdgpu_gfx_enforce_isolation_ring_end_use: Handles cleanup and state updates when finishing the use of a GPU ring. It may also adjust the KFD scheduler, using a mutex to manage shared data access.
Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a69f4cc2 |
| 29-Nov-2024 |
Srinivasan Shanmugam <[email protected]> |
drm/amd/amdgpu: Add Descriptions to Process Isolation and Cleaner Shader Sysfs Functions
This update adds explanations to key functions related to process isolation and cleaner shader execution sysf
drm/amd/amdgpu: Add Descriptions to Process Isolation and Cleaner Shader Sysfs Functions
This update adds explanations to key functions related to process isolation and cleaner shader execution sysfs interfaces.
- `amdgpu_gfx_set_run_cleaner_shader`: Describes how to manually run a cleaner shader, which clears the Local Data Store (LDS) and General Purpose Registers (GPRs) to ensure data isolation between GPU workloads.
- `amdgpu_gfx_get_enforce_isolation`: Describes how to query the current settings of the 'enforce_isolation' feature for each GPU partition.
- `amdgpu_gfx_set_enforce_isolation`: Describes how to enable or disable process isolation for GPU partitions through the sysfs interface.
Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
ff69bba0 |
| 03-Oct-2024 |
Boyuan Zhang <[email protected]> |
drm/amd/pm: add inst to dpm_set_powergating_by_smu
Add an instance parameter to amdgpu_dpm_set_powergating_by_smu() function, and use the instance to call set_powergating_by_smu().
v2: remove dupli
drm/amd/pm: add inst to dpm_set_powergating_by_smu
Add an instance parameter to amdgpu_dpm_set_powergating_by_smu() function, and use the instance to call set_powergating_by_smu().
v2: remove duplicated functions.
remove for-loop in amdgpu_dpm_set_powergating_by_smu(), and temporarily move it to amdgpu_dpm_enable_vcn(), in order to keep the exact same logic as before, until further separation in next patch.
v3: drop SI logic in amdgpu_dpm_enable_vcn().
Signed-off-by: Boyuan Zhang <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
2f1b1352 |
| 18-Nov-2024 |
[email protected] <[email protected]> |
drm/amdgpu: Fix sysfs warning when hotplugging
Fix the similar warning when hotplugging:
[ 155.585721] kernfs: can not remove 'enforce_isolation', no directory [ 155.592201] WARNING: CPU: 3 PID:
drm/amdgpu: Fix sysfs warning when hotplugging
Fix the similar warning when hotplugging:
[ 155.585721] kernfs: can not remove 'enforce_isolation', no directory [ 155.592201] WARNING: CPU: 3 PID: 6960 at fs/kernfs/dir.c:1683 kernfs_remove_by_name_ns+0xb9/0xc0 [ 155.601145] Modules linked in: xt_MASQUERADE xt_comment nft_compat veth bridge stp llc overlay nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd amdgpu kvm_amd kvm ipmi_ssif amdxcp rapl drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm pcspkr drm_display_helper acpi_cpufreq drm_kms_helper video wmi k10temp i2c_piix4 acpi_ipmi ipmi_si drm zram ip_tables loop squashfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco ixgbe rfkill ccp dca sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler fuse [ 155.685224] systemd-journald[1354]: Compressed data object 957 -> 524 using ZSTD [ 155.685687] CPU: 3 PID: 6960 Comm: amd_pci_unplug Not tainted 6.10.0-1148853.1.zuul.164395107d6642bdb451071313e9378d #1 [ 155.704149] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 155.712383] RIP: 0010:kernfs_remove_by_name_ns+0xb9/0xc0 [ 155.717805] Code: a0 00 48 89 ef e8 37 96 c7 ff 5b b8 fe ff ff ff 5d 41 5c 41 5d e9 f7 96 a0 00 0f 0b eb ab 48 c7 c7 48 ba 7e 8f e8 f7 66 bf ff <0f> 0b eb dc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 155.736766] RSP: 0018:ffffb1685d7a3e20 EFLAGS: 00010296 [ 155.742108] RAX: 0000000000000038 RBX: ffff929e94c80000 RCX: 0000000000000000 [ 155.749363] RDX: ffff928e1efaf200 RSI: ffff928e1efa18c0 RDI: ffff928e1efa18c0 [ 155.756612] RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000003 [ 155.763855] R10: ffffb1685d7a3cd8 R11: ffffffff8fb3e1c8 R12: ffffffffc1ef5341 [ 155.771104] R13: ffff929e94cc5530 R14: 0000000000000000 R15: 0000000000000000 [ 155.778357] FS: 00007fd9dd8d9c40(0000) GS:ffff928e1ef80000(0000) knlGS:0000000000000000 [ 155.786594] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155.792450] CR2: 0000561245ceee38 CR3: 0000000113018000 CR4: 00000000003506f0 [ 155.799702] Call Trace: [ 155.802254] <TASK> [ 155.804460] ? __warn+0x80/0x120 [ 155.807798] ? kernfs_remove_by_name_ns+0xb9/0xc0 [ 155.812617] ? report_bug+0x164/0x190 [ 155.816393] ? handle_bug+0x3c/0x80 [ 155.819994] ? exc_invalid_op+0x17/0x70 [ 155.823939] ? asm_exc_invalid_op+0x1a/0x20 [ 155.828235] ? kernfs_remove_by_name_ns+0xb9/0xc0 [ 155.833058] amdgpu_gfx_sysfs_fini+0x59/0xd0 [amdgpu] [ 155.838637] gfx_v9_0_sw_fini+0x123/0x1c0 [amdgpu] [ 155.843887] amdgpu_device_fini_sw+0xbc/0x3e0 [amdgpu] [ 155.849432] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ 155.855235] drm_dev_put.part.0+0x3c/0x60 [drm] [ 155.859914] drm_release+0x8b/0xc0 [drm] [ 155.863978] __fput+0xf1/0x2c0 [ 155.867141] __x64_sys_close+0x3c/0x80 [ 155.870998] do_syscall_64+0x64/0x170
V2: Add details in comments (Tim)
Signed-off-by: Jesse Zhang <[email protected]> Reported-by: Andy Dong <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
8521e3c5 |
| 23-Oct-2024 |
Shaoyun Liu <[email protected]> |
drm/amd/amdgpu: limit single process inside MES
This is for MES to limit only one process for the user queues
Signed-off-by: Shaoyun Liu <[email protected]> Reviewed-by: Alex Deucher <alexander.d
drm/amd/amdgpu: limit single process inside MES
This is for MES to limit only one process for the user queues
Signed-off-by: Shaoyun Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
84a2947e |
| 30-Oct-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Implement virt req_ras_err_count
Enable RAS late init if VF RAS Telemetry is supported.
When enabled, the VF can use this interface to query total RAS error counts from the host.
The
drm/amdgpu: Implement virt req_ras_err_count
Enable RAS late init if VF RAS Telemetry is supported.
When enabled, the VF can use this interface to query total RAS error counts from the host.
The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input.
The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data.
v2: Flip generate report & update statistics order for VF
Signed-off-by: Victor Skvortsov <[email protected]> Acked-by: Tao Zhou <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
7b1ebbe8 |
| 05-Nov-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Avoid kcq disable during reset
Reset sequence indicates that hardware already ran into a bad state. Avoid sending unmap queue request to reset KCQ. This will also cover RAS error scenari
drm/amdgpu: Avoid kcq disable during reset
Reset sequence indicates that hardware already ran into a bad state. Avoid sending unmap queue request to reset KCQ. This will also cover RAS error scenarios which need a reset to recover, hence remove the check.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
fa317985 |
| 05-Nov-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Fix map/unmap queue logic
In current logic, it calls ring_alloc followed by a ring_test. ring_test in turn will call another ring_alloc. This is illegal usage as a ring_alloc is expected
drm/amdgpu: Fix map/unmap queue logic
In current logic, it calls ring_alloc followed by a ring_test. ring_test in turn will call another ring_alloc. This is illegal usage as a ring_alloc is expected to be closed properly with a ring_commit. Change to commit the map/unmap queue packet first followed by a ring_test. Add a comment about the usage of ring_test.
Also, reorder the current pre-condition checks of job hang or kiq ring scheduler not ready. Without them being met, it is not useful to attempt ring or memory allocations.
Fixes tag refers to the original patch which introduced this issue which then got carried over into newer code.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Fixes: 6c10b5cc4eaa ("drm/amdgpu: Remove duplicate code in gfx_v8_0.c")
show more ...
|
| #
6c8d1f4b |
| 05-Nov-2024 |
[email protected] <[email protected]> |
drm/amdgpu: Add sysfs interface for gc reset mask
Add two sysfs interfaces for gfx and compute: gfx_reset_mask compute_reset_mask
These interfaces are read-only and show the resets supported by the
drm/amdgpu: Add sysfs interface for gc reset mask
Add two sysfs interfaces for gfx and compute: gfx_reset_mask compute_reset_mask
These interfaces are read-only and show the resets supported by the IP. For example, full adapter reset (mode1/mode2/BACO/etc), soft reset, queue reset, and pipe reset.
V2: the sysfs node returns a text string instead of some flags (Christian) v3: add a generic helper which takes the ring as parameter and print the strings in the order they are applied (Christian)
check amdgpu_gpu_recovery before creating sysfs file itself, and initialize supported_reset_types in IP version files (Lijo) v4: Fixing uninitialized variables (Tim)
Signed-off-by: Jesse Zhang <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
047767dd |
| 29-Oct-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Group gfx sysfs functions
Make amdgpu_gfx_sysfs_init/fini functions as common entry points for all gfx related sysfs nodes.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: A
drm/amdgpu: Group gfx sysfs functions
Make amdgpu_gfx_sysfs_init/fini functions as common entry points for all gfx related sysfs nodes.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|