|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
aedc92be |
| 24-Mar-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Parse all deferred errors with UMC aca handle
We should only increase the deferred errors in UMC block.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <Hawking.
drm/amdgpu: Parse all deferred errors with UMC aca handle
We should only increase the deferred errors in UMC block.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
338f7412 |
| 19-Mar-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Decode deferred error type in gfx aca bank parser
In the case of injecting uncorrected error with background workload, the deferred error among uncorrected errors need to be specified by
drm/amdgpu: Decode deferred error type in gfx aca bank parser
In the case of injecting uncorrected error with background workload, the deferred error among uncorrected errors need to be specified by checking the deferred and poison bits of status register.
v2: refine checking for deferred error v2: log possiable DEs among CEs v2: generate CPER records for DEs among UEs
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
00f85667 |
| 26-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Decode deferred error type in aca bank parser
In the case of poison inband log, the error type need to be specified by checking the deferred or poison bit of status register.
v2: check
drm/amdgpu: Decode deferred error type in aca bank parser
In the case of poison inband log, the error type need to be specified by checking the deferred or poison bit of status register.
v2: check both deferred and poison bit
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
ad97840f |
| 26-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Introduce funcs for generating cper record
Introduce new functions that are used to generate cper ue or ce records.
v2: return -ENOMEM instead of false v2: check return value of fill se
drm/amdgpu: Introduce funcs for generating cper record
Introduce new functions that are used to generate cper ue or ce records.
v2: return -ENOMEM instead of false v2: check return value of fill section function
Signed-off-by: Hawking Zhang <[email protected]> Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
56316ee9 |
| 26-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Include ACA error type in aca bank
ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware.
To address this, for each ACA bank, include both the ACA
drm/amdgpu: Include ACA error type in aca bank
ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware.
To address this, for each ACA bank, include both the ACA error type and the ACA SMU type.
This addition is useful for creating CPER records.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
abfcf956 |
| 29-Nov-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: move common ACA ipid defines into amdgpu_aca.h
move common ACA ipid defines into amdgpu_aca.h file.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <Hawking
drm/amdgpu: move common ACA ipid defines into amdgpu_aca.h
move common ACA ipid defines into amdgpu_aca.h file.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
a4fcb5f7 |
| 18-Jun-2024 |
Yang Wang <[email protected]> |
Revert "drm/amdgpu: change aca bank error lock type to spinlock"
This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.
Revert this patch to modify lock type back to 'mutex' to avoid kernel
Revert "drm/amdgpu: change aca bank error lock type to spinlock"
This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.
Revert this patch to modify lock type back to 'mutex' to avoid kernel calltrace issue.
[ 602.668806] Workqueue: amdgpu-reset-dev amdgpu_ras_do_recovery [amdgpu] [ 602.668939] Call Trace: [ 602.668940] <TASK> [ 602.668941] dump_stack_lvl+0x4c/0x70 [ 602.668945] dump_stack+0x14/0x20 [ 602.668946] __schedule_bug+0x5a/0x70 [ 602.668950] __schedule+0x940/0xb30 [ 602.668952] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.668955] ? hrtimer_reprogram+0x77/0xb0 [ 602.668957] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.668959] ? hrtimer_start_range_ns+0x126/0x370 [ 602.668961] schedule+0x39/0xe0 [ 602.668962] schedule_hrtimeout_range_clock+0xb1/0x140 [ 602.668964] ? __pfx_hrtimer_wakeup+0x10/0x10 [ 602.668966] schedule_hrtimeout_range+0x17/0x20 [ 602.668967] usleep_range_state+0x69/0x90 [ 602.668970] psp_cmd_submit_buf+0x132/0x570 [amdgpu] [ 602.669066] psp_ras_invoke+0x75/0x1a0 [amdgpu] [ 602.669156] psp_ras_query_address+0x9c/0x120 [amdgpu] [ 602.669245] umc_v12_0_update_ecc_status+0x16d/0x520 [amdgpu] [ 602.669337] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669339] ? stack_depot_save+0x12/0x20 [ 602.669342] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669343] ? set_track_prepare+0x52/0x70 [ 602.669346] ? kmemleak_alloc+0x4f/0x90 [ 602.669348] ? __kmalloc_node+0x34b/0x450 [ 602.669352] amdgpu_umc_update_ecc_status+0x23/0x40 [amdgpu] [ 602.669438] mca_umc_mca_get_err_count+0x85/0xc0 [amdgpu] [ 602.669554] mca_smu_parse_mca_error_count+0x120/0x1d0 [amdgpu] [ 602.669655] amdgpu_mca_dispatch_mca_set.part.0+0x141/0x250 [amdgpu] [ 602.669743] ? kmemleak_free+0x36/0x60 [ 602.669745] ? kvfree+0x32/0x40 [ 602.669747] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669749] ? kfree+0x15d/0x2a0 [ 602.669752] amdgpu_mca_smu_log_ras_error+0x1f6/0x210 [amdgpu] [ 602.669839] amdgpu_ras_query_error_status_helper+0x2ad/0x390 [amdgpu] [ 602.669924] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669925] ? __call_rcu_common.constprop.0+0xa6/0x2b0 [ 602.669929] amdgpu_ras_query_error_status+0xf3/0x620 [amdgpu] [ 602.670014] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.670017] amdgpu_ras_log_on_err_counter+0xe1/0x170 [amdgpu] [ 602.670103] amdgpu_ras_do_recovery+0xd2/0x2c0 [amdgpu] [ 602.670187] ? srso_alias_return_thunk+0x5/0
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3 |
|
| #
9817f061 |
| 04-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: move aca/mca init functions into ras_init() stage
adjust the function position to better match aca/mca fini code in ras_fini().
Signed-off-by: Yang Wang <[email protected]> Reviewe
drm/amdgpu: move aca/mca init functions into ras_init() stage
adjust the function position to better match aca/mca fini code in ras_fini().
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1 |
|
| #
062a7ce6 |
| 17-May-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: fix ACA no query result after gpu reset
fix ACA no query result after gpu reset.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-b
drm/amdgpu: fix ACA no query result after gpu reset
fix ACA no query result after gpu reset.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f6bce954 |
| 16-May-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: change aca bank error lock type to spinlock
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context.
Signed-off-by: Yang Wang <[email protected]> Reviewed-b
drm/amdgpu: change aca bank error lock type to spinlock
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4 |
|
| #
f2355862 |
| 12-Apr-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check in amdgpu_aca.c file
Signed-off-by: Yang Wang <kevinyang.wang@a
drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check in amdgpu_aca.c file
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2 |
|
| #
81d96e8b |
| 28-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()
refine function signature of amdgpu_aca_get_error_data();
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zho
drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()
refine function signature of amdgpu_aca_get_error_data();
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1 |
|
| #
31fd330b |
| 18-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add ras event id support for ACA
add ras event id support for ACA.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao
drm/amdgpu: add ras event id support for ACA
add ras event id support for ACA.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
bd15bf74 |
| 03-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: avoid update aca bank multi times during ras isr
Because the UE Valid MCA count will only be cleared after reset, in order to avoid repeated counting of the error count, the aca bank is
drm/amdgpu: avoid update aca bank multi times during ras isr
Because the UE Valid MCA count will only be cleared after reset, in order to avoid repeated counting of the error count, the aca bank is only updated once during ras isr.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
e3d4de8d |
| 22-Feb-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <Hawking.Zhang@am
drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4 |
|
| #
949899cb |
| 06-Feb-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add new api to save error count into aca cache
add new api to save error count into aca cache.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <Hawking.Zhan
drm/amdgpu: add new api to save error count into aca cache
add new api to save error count into aca cache.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3 |
|
| #
abc3b5d2 |
| 31-Jan-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add new aca_smu_type support
Add new types to distinguish between ACA error type and smu mca type.
e.g.: the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank channel, so ad
drm/amdgpu: add new aca_smu_type support
Add new types to distinguish between ACA error type and smu mca type.
e.g.: the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank channel, so add new type 'aca_smu_type' to distinguish aca error type and smu mca type.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2 |
|
| #
c0c48f0d |
| 24-Jan-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: adjust aca init/fini sequence to match gpu reset
- move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset()
Signed-off-by: Yan
drm/amdgpu: adjust aca init/fini sequence to match gpu reset
- move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset()
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7 |
|
| #
37973b69 |
| 02-Jan-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add aca sysfs support
add aca sysfs node support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <alexan
drm/amdgpu: add aca sysfs support
add aca sysfs node support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3 |
|
| #
04c4fcd2 |
| 24-Nov-2023 |
Yang Wang <[email protected]> |
drm/amdgpu: add amdgpu ras aca query interface
v1: add ACA error query interface
v2: Add a new helper function to determine whether to use ACA or MCA.
Signed-off-by: Yang Wang <kevinyang.wang@amd.
drm/amdgpu: add amdgpu ras aca query interface
v1: add ACA error query interface
v2: Add a new helper function to determine whether to use ACA or MCA.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
33dcda51 |
| 24-Nov-2023 |
Yang Wang <[email protected]> |
drm/amdgpu: add ACA bank dump debugfs support
add ACA bank dump debugfs support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: A
drm/amdgpu: add ACA bank dump debugfs support
add ACA bank dump debugfs support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc2 |
|
| #
f5e4cc84 |
| 13-Nov-2023 |
Yang Wang <[email protected]> |
drm/amdgpu: implement RAS ACA driver framework
v1: implement new RAS ACA driver code framework.
v2: - rename aca_bank_set to aca_banks. - rename aca_source_xxx to aca_handle_xxx.
v3: Optimize some
drm/amdgpu: implement RAS ACA driver framework
v1: implement new RAS ACA driver code framework.
v2: - rename aca_bank_set to aca_banks. - rename aca_source_xxx to aca_handle_xxx.
v3: Optimize some function implementation details. (from Hawking's suggestion)
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|