|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
aedc92be |
| 24-Mar-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Parse all deferred errors with UMC aca handle
We should only increase the deferred errors in UMC block.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <Hawking.
drm/amdgpu: Parse all deferred errors with UMC aca handle
We should only increase the deferred errors in UMC block.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
338f7412 |
| 19-Mar-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Decode deferred error type in gfx aca bank parser
In the case of injecting uncorrected error with background workload, the deferred error among uncorrected errors need to be specified by
drm/amdgpu: Decode deferred error type in gfx aca bank parser
In the case of injecting uncorrected error with background workload, the deferred error among uncorrected errors need to be specified by checking the deferred and poison bits of status register.
v2: refine checking for deferred error v2: log possiable DEs among CEs v2: generate CPER records for DEs among UEs
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
ce615fe3 |
| 24-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Check if CPER enabled when generating CPER
In the case of CPER disabled, generating CPER will cause kernel NULL pointer dereference without checking.
Signed-off-by: Xiang Liu <xiang.liu
drm/amdgpu: Check if CPER enabled when generating CPER
In the case of CPER disabled, generating CPER will cause kernel NULL pointer dereference without checking.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
2f94469c |
| 19-Feb-2025 |
Xiang Liu <[email protected]> |
drm/amdgpu: Remove redundant check of adev
There is no need to check adev for sure.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex
drm/amdgpu: Remove redundant check of adev
There is no need to check adev for sure.
Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
652e0902 |
| 11-Feb-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Generate cper records
Encode the error information in CPER format and commit to the cper ring
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <keivnyang.wang
drm/amdgpu: Generate cper records
Encode the error information in CPER format and commit to the cper ring
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
ad97840f |
| 26-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Introduce funcs for generating cper record
Introduce new functions that are used to generate cper ue or ce records.
v2: return -ENOMEM instead of false v2: check return value of fill se
drm/amdgpu: Introduce funcs for generating cper record
Introduce new functions that are used to generate cper ue or ce records.
v2: return -ENOMEM instead of false v2: check return value of fill section function
Signed-off-by: Hawking Zhang <[email protected]> Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
56316ee9 |
| 26-Jan-2025 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Include ACA error type in aca bank
ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware.
To address this, for each ACA bank, include both the ACA
drm/amdgpu: Include ACA error type in aca bank
ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware.
To address this, for each ACA bank, include both the ACA error type and the ACA SMU type.
This addition is useful for creating CPER records.
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
2bb7dced |
| 06-Nov-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: fix ACA bank count boundary check error
fix ACA bank count boundary check error.
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework") Signed-off-by: Yang Wang <kevinya
drm/amdgpu: fix ACA bank count boundary check error
fix ACA bank count boundary check error.
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11 |
|
| #
0110ac11 |
| 11-Sep-2024 |
Yan Zhen <[email protected]> |
drm/amdgpu: fix typo in the comment
Correctly spelled comments make it easier for the reader to understand the code.
Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'receive
drm/amdgpu: fix typo in the comment
Correctly spelled comments make it easier for the reader to understand the code.
Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'received' in the comment & replace 'dsiable' with 'disable' in the comment & replace 'Initiailize' with 'Initialize' in the comment & replace 'disble' with 'disable' in the comment & replace 'Disbale' with 'Disable' in the comment & replace 'enogh' with 'enough' in the comment & replace 'availabe' with 'available' in the comment.
Acked-by: Christian König <[email protected]> Signed-off-by: Yan Zhen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
4416377a |
| 21-Aug-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add list empty check to avoid null pointer issue
Add list empty check to avoid null pointer issues in some corner cases. - list_for_each_entry_safe()
Signed-off-by: Yang Wang <kevinyang
drm/amdgpu: add list empty check to avoid null pointer issue
Add list empty check to avoid null pointer issues in some corner cases. - list_for_each_entry_safe()
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
671af066 |
| 02-Aug-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: remove RAS unused paramter 'err_addr'
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is
drm/amdgpu: remove RAS unused paramter 'err_addr'
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is no longer used since following patch.
Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
75ac6a25 |
| 25-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: refine amdgpu ras event id core code
v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter.
drm/amdgpu: refine amdgpu ras event id core code
v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter.
v2: add a warn log to show the location of function failure when calling amdgpu_ras_mark_event(). (Tao Zhou)
v3: change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.
v4: rename amdgpu_ras_get_recovery_event() to amdgpu_ras_get_fatal_error_event().
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5 |
|
| #
a4fcb5f7 |
| 18-Jun-2024 |
Yang Wang <[email protected]> |
Revert "drm/amdgpu: change aca bank error lock type to spinlock"
This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.
Revert this patch to modify lock type back to 'mutex' to avoid kernel
Revert "drm/amdgpu: change aca bank error lock type to spinlock"
This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.
Revert this patch to modify lock type back to 'mutex' to avoid kernel calltrace issue.
[ 602.668806] Workqueue: amdgpu-reset-dev amdgpu_ras_do_recovery [amdgpu] [ 602.668939] Call Trace: [ 602.668940] <TASK> [ 602.668941] dump_stack_lvl+0x4c/0x70 [ 602.668945] dump_stack+0x14/0x20 [ 602.668946] __schedule_bug+0x5a/0x70 [ 602.668950] __schedule+0x940/0xb30 [ 602.668952] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.668955] ? hrtimer_reprogram+0x77/0xb0 [ 602.668957] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.668959] ? hrtimer_start_range_ns+0x126/0x370 [ 602.668961] schedule+0x39/0xe0 [ 602.668962] schedule_hrtimeout_range_clock+0xb1/0x140 [ 602.668964] ? __pfx_hrtimer_wakeup+0x10/0x10 [ 602.668966] schedule_hrtimeout_range+0x17/0x20 [ 602.668967] usleep_range_state+0x69/0x90 [ 602.668970] psp_cmd_submit_buf+0x132/0x570 [amdgpu] [ 602.669066] psp_ras_invoke+0x75/0x1a0 [amdgpu] [ 602.669156] psp_ras_query_address+0x9c/0x120 [amdgpu] [ 602.669245] umc_v12_0_update_ecc_status+0x16d/0x520 [amdgpu] [ 602.669337] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669339] ? stack_depot_save+0x12/0x20 [ 602.669342] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669343] ? set_track_prepare+0x52/0x70 [ 602.669346] ? kmemleak_alloc+0x4f/0x90 [ 602.669348] ? __kmalloc_node+0x34b/0x450 [ 602.669352] amdgpu_umc_update_ecc_status+0x23/0x40 [amdgpu] [ 602.669438] mca_umc_mca_get_err_count+0x85/0xc0 [amdgpu] [ 602.669554] mca_smu_parse_mca_error_count+0x120/0x1d0 [amdgpu] [ 602.669655] amdgpu_mca_dispatch_mca_set.part.0+0x141/0x250 [amdgpu] [ 602.669743] ? kmemleak_free+0x36/0x60 [ 602.669745] ? kvfree+0x32/0x40 [ 602.669747] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669749] ? kfree+0x15d/0x2a0 [ 602.669752] amdgpu_mca_smu_log_ras_error+0x1f6/0x210 [amdgpu] [ 602.669839] amdgpu_ras_query_error_status_helper+0x2ad/0x390 [amdgpu] [ 602.669924] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.669925] ? __call_rcu_common.constprop.0+0xa6/0x2b0 [ 602.669929] amdgpu_ras_query_error_status+0xf3/0x620 [amdgpu] [ 602.670014] ? srso_alias_return_thunk+0x5/0xfbef5 [ 602.670017] amdgpu_ras_log_on_err_counter+0xe1/0x170 [amdgpu] [ 602.670103] amdgpu_ras_do_recovery+0xd2/0x2c0 [amdgpu] [ 602.670187] ? srso_alias_return_thunk+0x5/0
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3 |
|
| #
9817f061 |
| 04-Jun-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: move aca/mca init functions into ras_init() stage
adjust the function position to better match aca/mca fini code in ras_fini().
Signed-off-by: Yang Wang <[email protected]> Reviewe
drm/amdgpu: move aca/mca init functions into ras_init() stage
adjust the function position to better match aca/mca fini code in ras_fini().
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1 |
|
| #
062a7ce6 |
| 17-May-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: fix ACA no query result after gpu reset
fix ACA no query result after gpu reset.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-b
drm/amdgpu: fix ACA no query result after gpu reset
fix ACA no query result after gpu reset.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
f6bce954 |
| 16-May-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: change aca bank error lock type to spinlock
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context.
Signed-off-by: Yang Wang <[email protected]> Reviewed-b
drm/amdgpu: change aca bank error lock type to spinlock
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
b2aa3d4b |
| 14-May-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add debug flag to enable RAS ACA
Use debug_mask=0x10 (BIT.4) param to help enable RAS ACA. (RAS ACA is disabled by default.)
Signed-off-by: Yang Wang <[email protected]> Reviewed-b
drm/amdgpu: add debug flag to enable RAS ACA
Use debug_mask=0x10 (BIT.4) param to help enable RAS ACA. (RAS ACA is disabled by default.)
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9 |
|
| #
e6ae021a |
| 09-May-2024 |
Jesse Zhang <[email protected]> |
drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type
Filter invalid aca error types before performing a shift operation.
Signed-off-by: Jesse Zhang <[email protected]> Revi
drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type
Filter invalid aca error types before performing a shift operation.
Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7 |
|
| #
a6bcffa5 |
| 30-Apr-2024 |
Hawking Zhang <[email protected]> |
drm/amdgpu: Add smu v13_0_14 ip block
Add smu v13_0_14 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <alexande
drm/amdgpu: Add smu v13_0_14 ip block
Add smu v13_0_14 ip block support
Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4 |
|
| #
f2355862 |
| 12-Apr-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check in amdgpu_aca.c file
Signed-off-by: Yang Wang <kevinyang.wang@a
drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check in amdgpu_aca.c file
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2 |
|
| #
81d96e8b |
| 28-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()
refine function signature of amdgpu_aca_get_error_data();
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zho
drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()
refine function signature of amdgpu_aca_get_error_data();
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1 |
|
| #
31fd330b |
| 18-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add ras event id support for ACA
add ras event id support for ACA.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao
drm/amdgpu: add ras event id support for ACA
add ras event id support for ACA.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
bd15bf74 |
| 03-Mar-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: avoid update aca bank multi times during ras isr
Because the UE Valid MCA count will only be cleared after reset, in order to avoid repeated counting of the error count, the aca bank is
drm/amdgpu: avoid update aca bank multi times during ras isr
Because the UE Valid MCA count will only be cleared after reset, in order to avoid repeated counting of the error count, the aca bank is only updated once during ras isr.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
865d3397 |
| 21-Feb-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: add aca deferred error type support
add aca deferred error type support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-b
drm/amdgpu: add aca deferred error type support
add aca deferred error type support
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|
| #
e3d4de8d |
| 22-Feb-2024 |
Yang Wang <[email protected]> |
drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <Hawking.Zhang@am
drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.
Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
show more ...
|