History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h (Results 1 – 25 of 146)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# cc11dffc 25-Mar-2025 Stanley.Yang <[email protected]>

drm/amdgpu: Update ta ras block

Update ta ra block to keep sync with RAS TA.

Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher

drm/amdgpu: Update ta ras block

Update ta ra block to keep sync with RAS TA.

Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5
# d4bd7a50 26-Feb-2025 Xiang Liu <[email protected]>

drm/amdgpu: Report generic instead of unknown boot time errors

Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to

drm/amdgpu: Report generic instead of unknown boot time errors

Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to customers.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1
# 16b85a09 22-Jan-2025 Hawking Zhang <[email protected]>

drm/amdgpu: Update usage for bad page threshold

The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang <[email protected]>
Rev

drm/amdgpu: Update usage for bad page threshold

The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6
# a8d133e6 31-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes

All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes

All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use retired_page stored
on eeprom directly in non-nps1 mode even for legacy data. We need to
take different actions for different data, new data can be identified
from old data by UMC_CHANNEL_IDX_V2 flag.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 71a0e963 29-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: save UMC global channel index to eeprom

Save the global channel index returned by RAS TA to eeprom.
We can get memory physical address by MCA address and channel index.

Signed-off-by: T

drm/amdgpu: save UMC global channel index to eeprom

Save the global channel index returned by RAS TA to eeprom.
We can get memory physical address by MCA address and channel index.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc5
# e1ee2111 24-Oct-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
sch

drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.

An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 84a2947e 30-Oct-2024 Victor Skvortsov <[email protected]>

drm/amdgpu: Implement virt req_ras_err_count

Enable RAS late init if VF RAS Telemetry is supported.

When enabled, the VF can use this interface to query total
RAS error counts from the host.

The

drm/amdgpu: Implement virt req_ras_err_count

Enable RAS late init if VF RAS Telemetry is supported.

When enabled, the VF can use this interface to query total
RAS error counts from the host.

The VF FB access may abruptly end due to a fatal error,
therefore the VF must cache and sanitize the input.

The Host allows 15 Telemetry messages every 60 seconds, afterwhich
the host will ignore any more in-coming telemetry messages. The VF will
rate limit its msg calling to once every 5 seconds (12 times in 60 seconds).
While the VF is rate limited, it will continue to report the last
good cached data.

v2: Flip generate report & update statistics order for VF

Signed-off-by: Victor Skvortsov <[email protected]>
Acked-by: Tao Zhou <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6
# b17f8732 30-Aug-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Add helper to initialize badpage info

Add a separate function to read badpage data during initialization.
Reading bad pages will need hardware access and cannot be done during
reset. Hen

drm/amdgpu: Add helper to initialize badpage info

Add a separate function to read badpage data during initialization.
Reading bad pages will need hardware access and cannot be done during
reset. Hence in cases where device needs a full reset during
init itself, attempting to read will cause a deadlock.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# 671af066 02-Aug-2024 Yang Wang <[email protected]>

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is no longer used since following patch.

Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 792be2e2 01-Aug-2024 Tao Zhou <[email protected]>

drm/amdgpu: create function to check RAS RMA status

In the convenience of calling it globally.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-

drm/amdgpu: create function to check RAS RMA status

In the convenience of calling it globally.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# dfe9d047 01-Aug-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Add more types for boot time error reporting

Data abort exception and unknown errors are supported.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <tao.zhou1

drm/amdgpu: Add more types for boot time error reporting

Data abort exception and unknown errors are supported.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10
# a7e8467f 11-Jul-2024 YiPeng Chai <[email protected]>

drm/amdgpu: Remove unused code

Remove unused code.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <alexander.deucher

drm/amdgpu: Remove unused code

Remove unused code.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 56631dee 11-Jul-2024 YiPeng Chai <[email protected]>

drm/amdgpu: optimize logging deferred error info

1. Use pa_pfn as the radix-tree key index to log
deferred error info.
2. Use local array to store a row of bad pages.

Signed-off-by: YiPeng Chai

drm/amdgpu: optimize logging deferred error info

1. Use pa_pfn as the radix-tree key index to log
deferred error info.
2. Use local array to store a row of bad pages.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc7
# 59f488be 03-Jul-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras event state device attribute support

add amdgpu ras 'event_state' sysfs device attribute support

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <tao.zho

drm/amdgpu: add ras event state device attribute support

add amdgpu ras 'event_state' sysfs device attribute support

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc6
# 12b435a4 28-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras POSION_CONSUMPTION event id support

add amdgpu ras POSION_CONSUMPTION event id support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]

drm/amdgpu: add ras POSION_CONSUMPTION event id support

add amdgpu ras POSION_CONSUMPTION event id support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 5b9de259 27-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras POSION_CREATION event id support

add amdgpu ras POSION_CREATION event id support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Rev

drm/amdgpu: add ras POSION_CREATION event id support

add amdgpu ras POSION_CREATION event id support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 75ac6a25 25-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

v2:
add a warn log to show the location of function failure
when calling amdgpu_ras_mark_event(). (Tao Zhou)

v3:
change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.

v4:
rename amdgpu_ras_get_recovery_event() to
amdgpu_ras_get_fatal_error_event().

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 332210c1 04-Jul-2024 Yang Wang <[email protected]>

drm/amdgpu: remove redundant semicolons in RAS_EVENT_LOG

remove redundant semicolons in RAS_EVENT_LOG to avoid
code format check warning.

Fixes: b712d7c20133 ("drm/amdgpu: fix compiler 'side-effect

drm/amdgpu: remove redundant semicolons in RAS_EVENT_LOG

remove redundant semicolons in RAS_EVENT_LOG to avoid
code format check warning.

Fixes: b712d7c20133 ("drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG()")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 5f08275c 24-Jun-2024 YiPeng Chai <[email protected]>

drm/amdgpu: refine poison creation interrupt handler

In order to apply to the case where a large number
of ras poison interrupts:
1. Change to use variable to record poison creation
requests to a

drm/amdgpu: refine poison creation interrupt handler

In order to apply to the case where a large number
of ras poison interrupts:
1. Change to use variable to record poison creation
requests to avoid fifo full.
2. Prioritize handling poison creation requests
instead of following the order of requests
received by the driver.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 78146c1d 24-Jun-2024 YiPeng Chai <[email protected]>

drm/amdgpu: add variable to record the deferred error number read by driver

Add variable to record the deferred error
number read by driver.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewe

drm/amdgpu: add variable to record the deferred error number read by driver

Add variable to record the deferred error
number read by driver.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2
# 7e437167 29-May-2024 Tao Zhou <[email protected]>

drm/amdgpu: create amdgpu_ras_in_recovery to simplify code

Reduce redundant code and user doesn't need to pay attention to RAS
details.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawk

drm/amdgpu: create amdgpu_ras_in_recovery to simplify code

Reduce redundant code and user doesn't need to pay attention to RAS
details.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc1
# b95fa494 23-May-2024 Tao Zhou <[email protected]>

drm/amdgpu: add RAS is_rma flag

Set the flag to true if bad page number reaches threshold.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-

drm/amdgpu: add RAS is_rma flag

Set the flag to true if bad page number reaches threshold.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a474161e 30-May-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Update programming for boot error reporting

AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid.
The polling sequence is also simplifed according to
the latest firmware change.

Sign

drm/amdgpu: Update programming for boot error reporting

AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid.
The polling sequence is also simplifed according to
the latest firmware change.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 473af28d 28-May-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Estimate RAS reservation when report capacity v2

Add estimate of how much vram we need to reserve for RAS
when caculating the total available vram.

v2: apply the change to MP0 v13_0_2 a

drm/amdgpu: Estimate RAS reservation when report capacity v2

Add estimate of how much vram we need to reserve for RAS
when caculating the total available vram.

v2: apply the change to MP0 v13_0_2 and v13_0_14

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# cf85764e 21-May-2024 Hawking Zhang <[email protected]>

drm/amdgpu: correct hbm field in boot status

hbm filed takes bit 13 and bit 14 in boot status.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-

drm/amdgpu: correct hbm field in boot status

hbm filed takes bit 13 and bit 14 in boot status.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


123456