|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
| #
a86e0c0e |
| 15-Nov-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Add init level for post reset reinit
When device needs to be reset before initialization, it's not required for all IPs to be initialized before a reset. In such cases, it needs to ident
drm/amdgpu: Add init level for post reset reinit
When device needs to be reset before initialization, it's not required for all IPs to be initialized before a reset. In such cases, it needs to identify whether the IP/feature is initialized for the first time or whether it's reinitialized after a reset.
Add RESET_RECOVERY init level to identify post reset reinitialization phase. This only provides a device level identification, IP/features may choose to track their state independently also.
Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
1e4acf4d |
| 21-Aug-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Add reset on init handler for XGMI
In some cases, device needs to be reset before first use. Add handlers for doing device reset during driver init sequence.
Signed-off-by: Lijo Lazar <
drm/amdgpu: Add reset on init handler for XGMI
In some cases, device needs to be reset before first use. Add handlers for doing device reset during driver init sequence.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
19cff165 |
| 02-Aug-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: abort KIQ waits when there is a pending reset
Stop waiting for the KIQ to return back when there is a reset pending. It's quite likely that the KIQ will never response.
Signed-off-by: K
drm/amdgpu: abort KIQ waits when there is a pending reset
Stop waiting for the KIQ to return back when there is a reset pending. It's quite likely that the KIQ will never response.
Signed-off-by: Koenig Christian <[email protected]> Suggested-by: Lazar Lijo <[email protected]> Tested-by: Victor Skvortsov <[email protected]> Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
2656e1ce |
| 03-Jun-2024 |
Eric Huang <[email protected]> |
drm/amdgpu: add reset sources in gpu reset context
reset source or reset cause is very useful info for reset context, it will be used by events API.
Suggested-by: Lijo Lazar <[email protected]> Si
drm/amdgpu: add reset sources in gpu reset context
reset source or reset cause is very useful info for reset context, it will be used by events API.
Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
25c01191 |
| 22-Apr-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FL
drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5 |
|
| #
ea137071 |
| 16-Apr-2024 |
Ahmad Rehman <[email protected]> |
drm/amdgpu: Skip the coredump collection on reset during driver reload
In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is
drm/amdgpu: Skip the coredump collection on reset during driver reload
In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection.
v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.
Signed-off-by: Ahmad Rehman <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
9022f01b |
| 20-Mar-2024 |
Sunil Khatri <[email protected]> |
drm/amdgpu: refactor code to split devcoredump code
Refractor devcoredump code into new files since its functionality is expanded further and better to slit and devcoredump to have its own file.
v2
drm/amdgpu: refactor code to split devcoredump code
Refractor devcoredump code into new files since its functionality is expanded further and better to slit and devcoredump to have its own file.
v2: Fix the build failure caught by arm compiler of implicit function declaration with #ifdef
v3: squash in fix for implicit declaration error
Cc: Ivan Lipski <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
5e592956 |
| 01-Mar-2024 |
Sunil Khatri <[email protected]> |
drm/amdgpu: add ring timeout information in devcoredump
Add ring timeout related information in the amdgpu devcoredump file for debugging purposes.
During the gpu recovery process the registered ca
drm/amdgpu: add ring timeout information in devcoredump
Add ring timeout related information in the amdgpu devcoredump file for debugging purposes.
During the gpu recovery process the registered call is triggered and add the debug information in data file created by devcoredump framework under the directory /sys/class/devcoredump/devcdx/
Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1 |
|
| #
fb1c93c2 |
| 10-Jan-2024 |
Christian König <[email protected]> |
drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"
Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callback
drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"
Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callbacks.
Using the IP callbacks like this can lead to memory leaks, double free and imbalanced reference counters.
Leaving the HW in an active state can lead to DMA accesses to memory now freed by the driver.
Both is a complete no-go for driver unload so completely revert the workaround for now.
This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04.
Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
show more ...
|
| #
087a3e13 |
| 10-Jan-2024 |
Christian König <[email protected]> |
drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"
Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callback
drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"
Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callbacks.
Using the IP callbacks like this can lead to memory leaks, double free and imbalanced reference counters.
Leaving the HW in an active state can lead to DMA accesses to memory now freed by the driver.
Both is a complete no-go for driver unload so completely revert the workaround for now.
This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04.
Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
de009982 |
| 15-Sep-2023 |
André Almeida <[email protected]> |
drm/amdgpu: Create version number for coredumps
Even if there's nothing currently parsing amdgpu's coredump files, if we eventually have such tools they will be glad to find a version field to prope
drm/amdgpu: Create version number for coredumps
Even if there's nothing currently parsing amdgpu's coredump files, if we eventually have such tools they will be glad to find a version field to properly read the file.
Create a version number to be displayed on top of coredump file, to be incremented when the file format or content get changed.
Signed-off-by: André Almeida <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
69619868 |
| 15-Sep-2023 |
André Almeida <[email protected]> |
drm/amdgpu: Move coredump code to amdgpu_reset file
Giving that we use codedump just for device resets, move it's functions and structs to a more semantic file, the amdgpu_reset.{c, h}.
Signed-off-
drm/amdgpu: Move coredump code to amdgpu_reset file
Giving that we use codedump just for device resets, move it's functions and structs to a more semantic file, the amdgpu_reset.{c, h}.
Signed-off-by: André Almeida <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5 |
|
| #
f8a499ae |
| 05-Aug-2023 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Keep reset handlers shared
Instead of maintaining a list per device, keep the reset handlers common per ASIC family. A pointer to the list of handlers is maintained in reset control.
Si
drm/amdgpu: Keep reset handlers shared
Instead of maintaining a list per device, keep the reset handlers common per ASIC family. A pointer to the list of handlers is maintained in reset control.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Tested-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1 |
|
| #
a340847b |
| 13-Oct-2022 |
Victor Zhao <[email protected]> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it.
Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b98a1648 |
| 13-Oct-2022 |
Victor Zhao <[email protected]> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it.
Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.0 |
|
| #
f61a825a |
| 28-Sep-2022 |
Vignesh Chander <[email protected]> |
drm/amdgpu: Skip put_reset_domain if it doesn't exist
For xgmi sriov, the reset is handled by host driver and hive->reset_domain is not initialized so need to check if it exists before doing a put.
drm/amdgpu: Skip put_reset_domain if it doesn't exist
For xgmi sriov, the reset is handled by host driver and hive->reset_domain is not initialized so need to check if it exists before doing a put.
Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Shaoyun Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc7, v6.0-rc6, v6.0-rc5 |
|
| #
f5c7e779 |
| 07-Sep-2022 |
YiPeng Chai <[email protected]> |
drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a
drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed.
V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable.
V3: 1. Update commit comments. 2. Split a patch into multiple patches. 3. The current patch does: a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into the existing gpu recover path, which make all devices in hive list only have HW reset but no resume (except the base IP). b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover in amdgpu_pci_remove when removing the first device in hive list. c. When removing the first device, the IP blocks keyword function call sequence is as follows: .suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini. ^ | |-<----------<---------<----| The first three sequences are because of a call to amdgpu_device_gpu_recover. The three sequences will be executed in a loop until all devices in the hive list are iterated. The sequences starting from .hw_fini only apply to the first device. Since .suspend has been called before, except the resumed phase1 basic ip blocks, all other ip blocks .hw_fini of current device will do nothing. d. When removing other devices, the calling sequences is the same as legacy: .hw_fini -> .early_fini -> .sw_fini. Since .suspend has been called when removing the first device, except the resumed phase1 basic ip blocks, all of other ip blocks .hw_fini of current device will do nothing.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19 |
|
| #
dac6b808 |
| 28-Jul-2022 |
Victor Zhao <[email protected]> |
drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed
v2: move this part out from the as
drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed
v2: move this part out from the asic specific part
Signed-off-by: Victor Zhao <[email protected]> Acked-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
0a83bb35 |
| 03-Aug-2022 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Avoid another list of reset devices
A list of devices to be reset is already created in amdgpu_device_gpu_recover function. Creating another list with the same nodes is incorrect and not
drm/amdgpu: Avoid another list of reset devices
A list of devices to be reset is already created in amdgpu_device_gpu_recover function. Creating another list with the same nodes is incorrect and not supported in list_head. Instead, pass the device list as part of reset context.
Fixes: 9e08564727fc (drm/amdgpu: Refactor mode2 reset logic for v13.0.2) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18 |
|
| #
ab9a0b1f |
| 17-May-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Cache result of last reset at reset domain level.
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christ
drm/amdgpu: Cache result of last reset at reset domain level.
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4 |
|
| #
f5666d48 |
| 10-Feb-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Fix compile error.
Seems I forgot to add this to the relevant commit when submitting.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reported-by: kernel test robot <lkp@in
drm/amdgpu: Fix compile error.
Seems I forgot to add this to the relevant commit when submitting.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
show more ...
|
|
Revision tags: v5.17-rc3, v5.17-rc2 |
|
| #
e923be99 |
| 25-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Rework amdgpu_device_lock_adev
This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other
drm/amdgpu: Rework amdgpu_device_lock_adev
This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
show more ...
|
|
Revision tags: v5.17-rc1 |
|
| #
89a7a870 |
| 19-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Move in_gpu_reset into reset_domain
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <lij
drm/amdgpu: Move in_gpu_reset into reset_domain
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
show more ...
|
| #
d0fb18b5 |
| 19-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Move reset sem into reset_domain
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could b
drm/amdgpu: Move reset sem into reset_domain
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
show more ...
|
| #
cfbb6b00 |
| 21-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Rework reset domain to be refcounted.
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be bind
drm/amdgpu: Rework reset domain to be refcounted.
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself.
v4:
Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device.
Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo)
Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
show more ...
|