|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
cc11dffc |
| 25-Mar-2025 |
Stanley.Yang <[email protected]> |
drm/amdgpu: Update ta ras block
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher
drm/amdgpu: Update ta ras block
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
d4bd7a50 |
| 26-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Report generic instead of unknown boot time errors
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to
drm/amdgpu: Report generic instead of unknown boot time errors
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
16b85a09 |
| 22-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Update usage for bad page threshold
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting
Signed-off-by: Hawking Zhang <[email protected]> Rev
drm/amdgpu: Update usage for bad page threshold
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
a8d133e6 |
| 31-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes
All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use
drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes
All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
71a0e963 |
| 29-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: save UMC global channel index to eeprom
Save the global channel index returned by RAS TA to eeprom. We can get memory physical address by MCA address and channel index.
Signed-off-by: T
drm/amdgpu: save UMC global channel index to eeprom
Save the global channel index returned by RAS TA to eeprom. We can get memory physical address by MCA address and channel index.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5 |
|
| #
e1ee2111 |
| 24-Oct-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Prefer RAS recovery for scheduler hang
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A sch
drm/amdgpu: Prefer RAS recovery for scheduler hang
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset.
An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
84a2947e |
| 30-Oct-2024 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Implement virt req_ras_err_count
Enable RAS late init if VF RAS Telemetry is supported.
When enabled, the VF can use this interface to query total RAS error counts from the host.
The
drm/amdgpu: Implement virt req_ras_err_count
Enable RAS late init if VF RAS Telemetry is supported.
When enabled, the VF can use this interface to query total RAS error counts from the host.
The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input.
The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data.
v2: Flip generate report & update statistics order for VF
Signed-off-by: Victor Skvortsov <[email protected]> Acked-by: Tao Zhou <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
b17f8732 |
| 30-Aug-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Add helper to initialize badpage info
Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hen
drm/amdgpu: Add helper to initialize badpage info
Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hence in cases where device needs a full reset during init itself, attempting to read will cause a deadlock.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
671af066 |
| 02-Aug-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: remove RAS unused paramter 'err_addr'
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is
drm/amdgpu: remove RAS unused paramter 'err_addr'
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is no longer used since following patch.
Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
792be2e2 |
| 01-Aug-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: create function to check RAS RMA status
In the convenience of calling it globally.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-
drm/amdgpu: create function to check RAS RMA status
In the convenience of calling it globally.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
dfe9d047 |
| 01-Aug-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Add more types for boot time error reporting
Data abort exception and unknown errors are supported.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <tao.zhou1
drm/amdgpu: Add more types for boot time error reporting
Data abort exception and unknown errors are supported.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10 |
|
| #
a7e8467f |
| 11-Jul-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: Remove unused code
Remove unused code.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <alexander.deucher
drm/amdgpu: Remove unused code
Remove unused code.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
56631dee |
| 11-Jul-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: optimize logging deferred error info
1. Use pa_pfn as the radix-tree key index to log deferred error info. 2. Use local array to store a row of bad pages.
Signed-off-by: YiPeng Chai
drm/amdgpu: optimize logging deferred error info
1. Use pa_pfn as the radix-tree key index to log deferred error info. 2. Use local array to store a row of bad pages.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc7 |
|
| #
59f488be |
| 03-Jul-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add ras event state device attribute support
add amdgpu ras 'event_state' sysfs device attribute support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <tao.zho
drm/amdgpu: add ras event state device attribute support
add amdgpu ras 'event_state' sysfs device attribute support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6 |
|
| #
12b435a4 |
| 28-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add ras POSION_CONSUMPTION event id support
add amdgpu ras POSION_CONSUMPTION event id support.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]
drm/amdgpu: add ras POSION_CONSUMPTION event id support
add amdgpu ras POSION_CONSUMPTION event id support.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5b9de259 |
| 27-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add ras POSION_CREATION event id support
add amdgpu ras POSION_CREATION event id support.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Rev
drm/amdgpu: add ras POSION_CREATION event id support
add amdgpu ras POSION_CREATION event id support.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
75ac6a25 |
| 25-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: refine amdgpu ras event id core code
v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter.
drm/amdgpu: refine amdgpu ras event id core code
v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter.
v2: add a warn log to show the location of function failure when calling amdgpu_ras_mark_event(). (Tao Zhou)
v3: change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.
v4: rename amdgpu_ras_get_recovery_event() to amdgpu_ras_get_fatal_error_event().
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
332210c1 |
| 04-Jul-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: remove redundant semicolons in RAS_EVENT_LOG
remove redundant semicolons in RAS_EVENT_LOG to avoid code format check warning.
Fixes: b712d7c20133 ("drm/amdgpu: fix compiler 'side-effect
drm/amdgpu: remove redundant semicolons in RAS_EVENT_LOG
remove redundant semicolons in RAS_EVENT_LOG to avoid code format check warning.
Fixes: b712d7c20133 ("drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG()") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
5f08275c |
| 24-Jun-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: refine poison creation interrupt handler
In order to apply to the case where a large number of ras poison interrupts: 1. Change to use variable to record poison creation requests to a
drm/amdgpu: refine poison creation interrupt handler
In order to apply to the case where a large number of ras poison interrupts: 1. Change to use variable to record poison creation requests to avoid fifo full. 2. Prioritize handling poison creation requests instead of following the order of requests received by the driver.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
78146c1d |
| 24-Jun-2024 |
YiPeng Chai <[email protected]> |
drm/amdgpu: add variable to record the deferred error number read by driver
Add variable to record the deferred error number read by driver.
Signed-off-by: YiPeng Chai <[email protected]> Reviewe
drm/amdgpu: add variable to record the deferred error number read by driver
Add variable to record the deferred error number read by driver.
Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
7e437167 |
| 29-May-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: create amdgpu_ras_in_recovery to simplify code
Reduce redundant code and user doesn't need to pay attention to RAS details.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawk
drm/amdgpu: create amdgpu_ras_in_recovery to simplify code
Reduce redundant code and user doesn't need to pay attention to RAS details.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1 |
|
| #
b95fa494 |
| 23-May-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: add RAS is_rma flag
Set the flag to true if bad page number reaches threshold.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-
drm/amdgpu: add RAS is_rma flag
Set the flag to true if bad page number reaches threshold.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a474161e |
| 30-May-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Update programming for boot error reporting
AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change.
Sign
drm/amdgpu: Update programming for boot error reporting
AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
473af28d |
| 28-May-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Estimate RAS reservation when report capacity v2
Add estimate of how much vram we need to reserve for RAS when caculating the total available vram.
v2: apply the change to MP0 v13_0_2 a
drm/amdgpu: Estimate RAS reservation when report capacity v2
Add estimate of how much vram we need to reserve for RAS when caculating the total available vram.
v2: apply the change to MP0 v13_0_2 and v13_0_14
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
cf85764e |
| 21-May-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: correct hbm field in boot status
hbm filed takes bit 13 and bit 14 in boot status.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-
drm/amdgpu: correct hbm field in boot status
hbm filed takes bit 13 and bit 14 in boot status.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|