|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
cc11dffc |
| 25-Mar-2025 |
Stanley.Yang <[email protected]> |
drm/amdgpu: Update ta ras block
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher
drm/amdgpu: Update ta ras block
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
05d50ea3 |
| 04-Mar-2025 |
Tao Zhou <[email protected]> |
drm/amdgpu: format old RAS eeprom data into V3 version
Clear old data and save it in V3 format.
v2: only format eeprom data for new ASICs.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by:
drm/amdgpu: format old RAS eeprom data into V3 version
Clear old data and save it in V3 format.
v2: only format eeprom data for new ASICs.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
13c13bdd |
| 28-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14
Enable ACA by default for psp v13_0_6/v13_0_14.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd
drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14
Enable ACA by default for psp v13_0_6/v13_0_14.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a4b6e990 |
| 11-Mar-2025 |
ganglxie <[email protected]> |
drm/amdgpu: Save PA of bad pages for old asics
for old asics that do not support mca translating, we just save PA for them
Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <tao.zhou
drm/amdgpu: Save PA of bad pages for old asics
for old asics that do not support mca translating, we just save PA for them
Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d4bd7a50 |
| 26-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Report generic instead of unknown boot time errors
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to
drm/amdgpu: Report generic instead of unknown boot time errors
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a8f921a1 |
| 24-Feb-2025 |
ganglxie <[email protected]> |
drm/amdgpu: Change page/record number calculation based on nps
save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16
Signed-off-by: ganglxie <[email protected]> Re
drm/amdgpu: Change page/record number calculation based on nps
save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16
Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
0153d276 |
| 24-Feb-2025 |
ganglxie <[email protected]> |
drm/amdgpu: Refine bad page adding
bad page adding can be simpler with nps info
Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <ale
drm/amdgpu: Refine bad page adding
bad page adding can be simpler with nps info
Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
2d0f5001 |
| 16-Dec-2024 |
Thomas Weißschuh <[email protected]> |
drm/amdgpu: Constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or ma
drm/amdgpu: Constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/20241216-sysfs-const-bin_attr-drm-v1-4-210f2b36b9bf@weissschuh.net Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
| #
59af05d6 |
| 11-Feb-2025 |
Candice Li <[email protected]> |
drm/amdgpu: Enable ACA by default for psp v13_0_12
Enable ACA by default for psp v13_0_12.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed
drm/amdgpu: Enable ACA by default for psp v13_0_12
Enable ACA by default for psp v13_0_12.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
04893397 |
| 21-Jan-2025 |
Victor Skvortsov <[email protected]> |
drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocks
VFs are not able to query error counts for all RAS blocks. Rather than returning error for queries on these blocks, skip sysfs
drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocks
VFs are not able to query error counts for all RAS blocks. Rather than returning error for queries on these blocks, skip sysfs the creation all together.
Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
16b85a09 |
| 22-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Update usage for bad page threshold
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting
Signed-off-by: Hawking Zhang <[email protected]> Rev
drm/amdgpu: Update usage for bad page threshold
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3 |
|
| #
9095567b |
| 13-Dec-2024 |
Srinivasan Shanmugam <[email protected]> |
drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages
It ensures that appropriate error codes are returned when an error condition is detected
Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgp
drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages
It ensures that appropriate error codes are returned when an error condition is detected
Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed. drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed.
v2: s/-EIO/-EINVAL, retained the use of -EINVAL from amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the RAS context is not initialized or the convert_ras_err_addr function is unavailable. (Thomas)
V3: Returning 0 as the absence of eh_data is acceptable. (Tao)
Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes") Reported-by: Dan Carpenter <[email protected]> Cc: YiPeng Chai <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d1ebe307 |
| 16-Dec-2024 |
Candice Li <[email protected]> |
drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Haw
drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3 |
|
| #
ecd1191e |
| 08-Aug-2024 |
Candice Li <[email protected]> |
drm/amdgpu: Support nbif v6_3_1 fatal error handling
Add nbif v6_3_1 fatal error handling support.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
drm/amdgpu: Support nbif v6_3_1 fatal error handling
Add nbif v6_3_1 fatal error handling support.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
2c2b84f1 |
| 04-Dec-2024 |
Candice Li <[email protected]> |
drm/amdgpu: Add psp v14_0_3 ras support
Add psp v14_0_3 ras support.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <a
drm/amdgpu: Add psp v14_0_3 ras support
Add psp v14_0_3 ras support.
Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
9a826c4a |
| 18-Aug-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Enable RAS for psp v13_0_12
Enable RAS Cap check and initialize RAS funcs for psp v13_0_12
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
drm/amdgpu: Enable RAS for psp v13_0_12
Enable RAS Cap check and initialize RAS funcs for psp v13_0_12
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
ae756cd8 |
| 29-Nov-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: correct the calculation of RAS bad page
After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two
drm/amdgpu: correct the calculation of RAS bad page
After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
1f06e7f3 |
| 28-Nov-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: split ras_eeprom_init into init and check functions
Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different
drm/amdgpu: split ras_eeprom_init into init and check functions
Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
d08fb663 |
| 01-Nov-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: remove is_mca_add for ras_add_bad_pages
Remove unnecessary variable and simplify the logic.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]
drm/amdgpu: remove is_mca_add for ras_add_bad_pages
Remove unnecessary variable and simplify the logic.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
a8d133e6 |
| 31-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes
All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use
drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes
All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
07dd49e1 |
| 24-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: support to find RAS bad pages via old TA
Old version of RAS TA doesn't support to convert MCA address stored on eeprom to physical address (PA), support to find all bad pages in one memo
drm/amdgpu: support to find RAS bad pages via old TA
Old version of RAS TA doesn't support to convert MCA address stored on eeprom to physical address (PA), support to find all bad pages in one memory row by PA with old RAS TA. This approach is only suitable for nps1 mode.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
c3d4acf0 |
| 18-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: store only one RAS bad page record for all pages in one row
So eeprom space can be saved, compatible with legacy way.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zh
drm/amdgpu: store only one RAS bad page record for all pages in one row
So eeprom space can be saved, compatible with legacy way.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e1ee2111 |
| 24-Oct-2024 |
Lijo Lazar <[email protected]> |
drm/amdgpu: Prefer RAS recovery for scheduler hang
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A sch
drm/amdgpu: Prefer RAS recovery for scheduler hang
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset.
An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases.
Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
0eecff79 |
| 18-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: do RAS MCA2PA conversion in device init phase
NPS mode is introduced, the value of memory physical address (PA) related to a MCA address varies per nps mode. We need to rely on MCA addre
drm/amdgpu: do RAS MCA2PA conversion in device init phase
NPS mode is introduced, the value of memory physical address (PA) related to a MCA address varies per nps mode. We need to rely on MCA address and convert it into PA accroding to nps mode.
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
772df3df |
| 18-Oct-2024 |
Tao Zhou <[email protected]> |
drm/amdgpu: add flag to indicate the type of RAS eeprom record
One UMC MCA address could map to multiply physical address (PA):
AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_R
drm/amdgpu: add flag to indicate the type of RAS eeprom record
One UMC MCA address could map to multiply physical address (PA):
AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA is not cared about
Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|