|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
a91d91b6 |
| 26-Feb-2025 |
Tony Yi <[email protected]> |
drm/amdgpu: Add support for CPERs on virtualization
Add support for CPERs on VFs.
VFs do not receive PMFW messages directly; as such, they need to query them from the host. To avoid hitting host ev
drm/amdgpu: Add support for CPERs on virtualization
Add support for CPERs on VFs.
VFs do not receive PMFW messages directly; as such, they need to query them from the host. To avoid hitting host event guard, CPER queries need to be rate limited. CPER queries share the same RAS telemetry buffer as error count query, so a mutex protecting the shared buffer was added as well.
For readability, the amdgpu_detect_virtualization was refactored into multiple individual functions.
Signed-off-by: Tony Yi <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
9928509d |
| 30-Oct-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Add msg handlers for SRIOV RAS Telemetry
Add message handlers for RAS telemetry.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]
drm/amdgpu: Add msg handlers for SRIOV RAS Telemetry
Add message handlers for RAS telemetry.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
cbda2758 |
| 24-Jun-2024 |
Vignesh Chander <[email protected]> |
drm/amdgpu: process RAS fatal error MB notification
For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset complet
drm/amdgpu: process RAS fatal error MB notification
For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host.
v2: add another mailbox check for handling case where kfd detects timeout first
v3: set host_flr bit and use wait_for_reset
Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5 |
|
| #
29b6985d |
| 16-Jun-2024 |
Vignesh Chander <[email protected]> |
drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter
So we can get clearer per device logging.
Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang L
drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter
So we can get clearer per device logging.
Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
5c0a1cdd |
| 24-May-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: fix sriov host flr handler
We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen.
In the current st
drm/amdgpu: fix sriov host flr handler
We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen.
In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later.
Signed-off-by: Yunxiang Li <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Emily Deng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
25c01191 |
| 22-Apr-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FL
drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f4322b9f |
| 22-Apr-2024 |
Yunxiang Li <[email protected]> |
drm/amdgpu: Fix two reset triggered in a row
Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule a
drm/amdgpu: Fix two reset triggered in a row
Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule after we call amdgpu_device_stop_pending_resets.
Move amdgpu_device_stop_pending_resets to after the reset is done. Since at this point the GPU is supposedly in a good state, any reset scheduled after this point would be a legitimate reset.
Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose.
Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4 |
|
| #
f5007c67 |
| 11-Apr-2024 |
Zhigang Luo <[email protected]> |
drm/amdgpu: update vf to pf message retry from 2 to 5
increase retry times to wait host has enough time to complete reset.
Signed-off-by: Zhigang Luo <[email protected]> Reviewed-by: Hawking Zhan
drm/amdgpu: update vf to pf message retry from 2 to 5
increase retry times to wait host has enough time to complete reset.
Signed-off-by: Zhigang Luo <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
6a009ca1 |
| 17-Apr-2024 |
Zhigang Luo <[email protected]> |
drm/amdgpu: remove virt_init_data_exchange from poison consumption handler
Host will initiate an FLR for all poison consumption. Guest should wait for FLR message to re-init data exchange.
Signed-o
drm/amdgpu: remove virt_init_data_exchange from poison consumption handler
Host will initiate an FLR for all poison consumption. Guest should wait for FLR message to re-init data exchange.
Signed-off-by: Zhigang Luo <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7 |
|
| #
ab66c832 |
| 29-Feb-2024 |
Zhigang Luo <[email protected]> |
drm/amdgpu: trigger flr_work if reading pf2vf data failed
if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue.
also use de
drm/amdgpu: trigger flr_work if reading pf2vf data failed
if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue.
also use dev_err to print the error message to get which device has issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL timeout.
Signed-off-by: Zhigang Luo <[email protected]> Acked-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
6bb89d13 |
| 14-Mar-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Skip virt_exchange_init on SDMA poison consumption
Host will initiate an FLR in SDMA poison consumption scenario. Guest should wait for FLR message to re-init data exchange.
Signed-off-
drm/amdgpu: Skip virt_exchange_init on SDMA poison consumption
Host will initiate an FLR in SDMA poison consumption scenario. Guest should wait for FLR message to re-init data exchange.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1 |
|
| #
2474414c |
| 21-Jan-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Add RAS_POISON_READY host response message
In a non-FLR page avoidance scenario, the host driver will provide the bad pages in the pf2vf exchange region.
Adding a new host response mess
drm/amdgpu: Add RAS_POISON_READY host response message
In a non-FLR page avoidance scenario, the host driver will provide the bad pages in the pf2vf exchange region.
Adding a new host response message to indicate when the pf2vf exchange region has been updated.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ed1e1e42 |
| 23-Jan-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks to SRIOV.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang
drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks to SRIOV.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5 |
|
| #
e2515e2b |
| 02-Aug-2023 |
Ran Sun <[email protected]> |
drm/amdgpu: Clean up errors in mxgpu_nv.c
Fix the following errors reported by checkpatch:
ERROR: else should follow close brace '}' ERROR: that open brace { should be on the previous line
Signed-
drm/amdgpu: Clean up errors in mxgpu_nv.c
Fix the following errors reported by checkpatch:
ERROR: else should follow close brace '}' ERROR: that open brace { should be on the previous line
Signed-off-by: Ran Sun <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1 |
|
| #
ae844dd7 |
| 06-Dec-2022 |
Tao Zhou <[email protected]> |
drm/amdgpu: add RAS poison consumption handler for NV SRIOV
Send handling request to host.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-
drm/amdgpu: add RAS poison consumption handler for NV SRIOV
Send handling request to host.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1 |
|
| #
a340847b |
| 13-Oct-2022 |
Victor Zhao <[email protected]> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it.
Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b98a1648 |
| 13-Oct-2022 |
Victor Zhao <[email protected]> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts
Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it.
Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19 |
|
| #
dac6b808 |
| 28-Jul-2022 |
Victor Zhao <[email protected]> |
drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed
v2: move this part out from the as
drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed
v2: move this part out from the asic specific part
Signed-off-by: Victor Zhao <[email protected]> Acked-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc8, v5.19-rc7, v5.19-rc6 |
|
| #
f1549c09 |
| 08-Jul-2022 |
Likun Gao <[email protected]> |
drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, for
drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset.
Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18 |
|
| #
cf727044 |
| 17-May-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function into reset domain queue who was using this name.
Sig
drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
89a7a870 |
| 19-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Move in_gpu_reset into reset_domain
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <lij
drm/amdgpu: Move in_gpu_reset into reset_domain
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
show more ...
|
| #
d0fb18b5 |
| 19-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Move reset sem into reset_domain
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could b
drm/amdgpu: Move reset sem into reset_domain
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment.
Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
show more ...
|
| #
cfbb6b00 |
| 21-Jan-2022 |
Andrey Grodzovsky <[email protected]> |
drm/amdgpu: Rework reset domain to be refcounted.
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be bind
drm/amdgpu: Rework reset domain to be refcounted.
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself.
v4:
Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device.
Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo)
Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7 |
|
| #
02599bc7 |
| 20-Dec-2021 |
Andrey Grodzovsky <[email protected]> |
drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
No need to to trigger another work queue inside the work queue.
v3:
Problem: Extra reset caused by host side FLR notification followin
drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
No need to to trigger another work queue inside the work queue.
v3:
Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset.
Suggested-by: Liu Shaoyun <[email protected]> Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Liu Shaoyun <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74114.html
show more ...
|
|
Revision tags: v5.16-rc6 |
|
| #
fa4a427d |
| 13-Dec-2021 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: SRIOV flr_work should use down_write
Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset go
drm/amdgpu: SRIOV flr_work should use down_write
Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed by: Shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|