History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c (Results 1 – 25 of 497)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# cc11dffc 25-Mar-2025 Stanley.Yang <[email protected]>

drm/amdgpu: Update ta ras block

Update ta ra block to keep sync with RAS TA.

Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher

drm/amdgpu: Update ta ras block

Update ta ra block to keep sync with RAS TA.

Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6
# 05d50ea3 04-Mar-2025 Tao Zhou <[email protected]>

drm/amdgpu: format old RAS eeprom data into V3 version

Clear old data and save it in V3 format.

v2: only format eeprom data for new ASICs.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by:

drm/amdgpu: format old RAS eeprom data into V3 version

Clear old data and save it in V3 format.

v2: only format eeprom data for new ASICs.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc5
# 13c13bdd 28-Feb-2025 Xiang Liu <[email protected]>

drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

Enable ACA by default for psp v13_0_6/v13_0_14.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd

drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

Enable ACA by default for psp v13_0_6/v13_0_14.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a4b6e990 11-Mar-2025 ganglxie <[email protected]>

drm/amdgpu: Save PA of bad pages for old asics

for old asics that do not support mca translating, we
just save PA for them

Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <tao.zhou

drm/amdgpu: Save PA of bad pages for old asics

for old asics that do not support mca translating, we
just save PA for them

Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d4bd7a50 26-Feb-2025 Xiang Liu <[email protected]>

drm/amdgpu: Report generic instead of unknown boot time errors

Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to

drm/amdgpu: Report generic instead of unknown boot time errors

Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to customers.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a8f921a1 24-Feb-2025 ganglxie <[email protected]>

drm/amdgpu: Change page/record number calculation based on nps

save only one record to save eeprom space,and
bad_page_num = pa_rec_num + mca_rec_num*16

Signed-off-by: ganglxie <[email protected]>
Re

drm/amdgpu: Change page/record number calculation based on nps

save only one record to save eeprom space,and
bad_page_num = pa_rec_num + mca_rec_num*16

Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 0153d276 24-Feb-2025 ganglxie <[email protected]>

drm/amdgpu: Refine bad page adding

bad page adding can be simpler with nps info

Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <ale

drm/amdgpu: Refine bad page adding

bad page adding can be simpler with nps info

Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4
# 2d0f5001 16-Dec-2024 Thomas Weißschuh <[email protected]>

drm/amdgpu: Constify 'struct bin_attribute'

The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or ma

drm/amdgpu: Constify 'struct bin_attribute'

The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.

Signed-off-by: Thomas Weißschuh <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Link: https://lore.kernel.org/r/20241216-sysfs-const-bin_attr-drm-v1-4-210f2b36b9bf@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <[email protected]>

show more ...


# 59af05d6 11-Feb-2025 Candice Li <[email protected]>

drm/amdgpu: Enable ACA by default for psp v13_0_12

Enable ACA by default for psp v13_0_12.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed

drm/amdgpu: Enable ACA by default for psp v13_0_12

Enable ACA by default for psp v13_0_12.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 04893397 21-Jan-2025 Victor Skvortsov <[email protected]>

drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocks

VFs are not able to query error counts for all RAS blocks. Rather than
returning error for queries on these blocks, skip sysfs

drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocks

VFs are not able to query error counts for all RAS blocks. Rather than
returning error for queries on these blocks, skip sysfs the creation
all together.

Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 16b85a09 22-Jan-2025 Hawking Zhang <[email protected]>

drm/amdgpu: Update usage for bad page threshold

The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang <[email protected]>
Rev

drm/amdgpu: Update usage for bad page threshold

The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc3
# 9095567b 13-Dec-2024 Srinivasan Shanmugam <[email protected]>

drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages

It ensures that appropriate error codes are returned when an error
condition is detected

Fixes the below;
drivers/gpu/drm/amd/amdgpu/amdgp

drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages

It ensures that appropriate error codes are returned when an error
condition is detected

Fixes the below;
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed.

v2: s/-EIO/-EINVAL, retained the use of -EINVAL from
amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the
RAS context is not initialized or the convert_ras_err_addr function is
unavailable. (Thomas)

V3: Returning 0 as the absence of eh_data is acceptable. (Tao)

Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes")
Reported-by: Dan Carpenter <[email protected]>
Cc: YiPeng Chai <[email protected]>
Cc: Tao Zhou <[email protected]>
Cc: Hawking Zhang <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d1ebe307 16-Dec-2024 Candice Li <[email protected]>

drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Haw

drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3
# ecd1191e 08-Aug-2024 Candice Li <[email protected]>

drm/amdgpu: Support nbif v6_3_1 fatal error handling

Add nbif v6_3_1 fatal error handling support.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>

drm/amdgpu: Support nbif v6_3_1 fatal error handling

Add nbif v6_3_1 fatal error handling support.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 2c2b84f1 04-Dec-2024 Candice Li <[email protected]>

drm/amdgpu: Add psp v14_0_3 ras support

Add psp v14_0_3 ras support.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <a

drm/amdgpu: Add psp v14_0_3 ras support

Add psp v14_0_3 ras support.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 9a826c4a 18-Aug-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Enable RAS for psp v13_0_12

Enable RAS Cap check and initialize RAS funcs
for psp v13_0_12

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>

drm/amdgpu: Enable RAS for psp v13_0_12

Enable RAS Cap check and initialize RAS funcs
for psp v13_0_12

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# ae756cd8 29-Nov-2024 Tao Zhou <[email protected]>

drm/amdgpu: correct the calculation of RAS bad page

After the introduction of NPS RAS, one bad page record on eeprom may be
related to 1 or 16 bad pages, so the bad page record and bad page are
two

drm/amdgpu: correct the calculation of RAS bad page

After the introduction of NPS RAS, one bad page record on eeprom may be
related to 1 or 16 bad pages, so the bad page record and bad page are
two different concepts, define a new variable to store bad page number.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 1f06e7f3 28-Nov-2024 Tao Zhou <[email protected]>

drm/amdgpu: split ras_eeprom_init into init and check functions

Init function is for ras table header read and check function is
responsible for the validation of the header. Call them in different

drm/amdgpu: split ras_eeprom_init into init and check functions

Init function is for ras table header read and check function is
responsible for the validation of the header. Call them in different
stages.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d08fb663 01-Nov-2024 Tao Zhou <[email protected]>

drm/amdgpu: remove is_mca_add for ras_add_bad_pages

Remove unnecessary variable and simplify the logic.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]

drm/amdgpu: remove is_mca_add for ras_add_bad_pages

Remove unnecessary variable and simplify the logic.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a8d133e6 31-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes

All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes

All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use retired_page stored
on eeprom directly in non-nps1 mode even for legacy data. We need to
take different actions for different data, new data can be identified
from old data by UMC_CHANNEL_IDX_V2 flag.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 07dd49e1 24-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: support to find RAS bad pages via old TA

Old version of RAS TA doesn't support to convert MCA address stored on
eeprom to physical address (PA), support to find all bad pages in one
memo

drm/amdgpu: support to find RAS bad pages via old TA

Old version of RAS TA doesn't support to convert MCA address stored on
eeprom to physical address (PA), support to find all bad pages in one
memory row by PA with old RAS TA. This approach is only suitable for
nps1 mode.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# c3d4acf0 18-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: store only one RAS bad page record for all pages in one row

So eeprom space can be saved, compatible with legacy way.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zh

drm/amdgpu: store only one RAS bad page record for all pages in one row

So eeprom space can be saved, compatible with legacy way.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e1ee2111 24-Oct-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
sch

drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.

An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 0eecff79 18-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: do RAS MCA2PA conversion in device init phase

NPS mode is introduced, the value of memory physical address (PA)
related to a MCA address varies per nps mode. We need to rely on
MCA addre

drm/amdgpu: do RAS MCA2PA conversion in device init phase

NPS mode is introduced, the value of memory physical address (PA)
related to a MCA address varies per nps mode. We need to rely on
MCA address and convert it into PA accroding to nps mode.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 772df3df 18-Oct-2024 Tao Zhou <[email protected]>

drm/amdgpu: add flag to indicate the type of RAS eeprom record

One UMC MCA address could map to multiply physical address (PA):

AMDGPU_RAS_EEPROM_REC_PA: one record store one PA
AMDGPU_RAS_EEPROM_R

drm/amdgpu: add flag to indicate the type of RAS eeprom record

One UMC MCA address could map to multiply physical address (PA):

AMDGPU_RAS_EEPROM_REC_PA: one record store one PA
AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA
is not cared about

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


12345678910>>...20