History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c (Results 1 – 25 of 36)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# aedc92be 24-Mar-2025 Xiang Liu <[email protected]>

drm/amdgpu: Parse all deferred errors with UMC aca handle

We should only increase the deferred errors in UMC block.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <Hawking.

drm/amdgpu: Parse all deferred errors with UMC aca handle

We should only increase the deferred errors in UMC block.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 338f7412 19-Mar-2025 Xiang Liu <[email protected]>

drm/amdgpu: Decode deferred error type in gfx aca bank parser

In the case of injecting uncorrected error with background workload,
the deferred error among uncorrected errors need to be specified
by

drm/amdgpu: Decode deferred error type in gfx aca bank parser

In the case of injecting uncorrected error with background workload,
the deferred error among uncorrected errors need to be specified
by checking the deferred and poison bits of status register.

v2: refine checking for deferred error
v2: log possiable DEs among CEs
v2: generate CPER records for DEs among UEs

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5
# ce615fe3 24-Feb-2025 Xiang Liu <[email protected]>

drm/amdgpu: Check if CPER enabled when generating CPER

In the case of CPER disabled, generating CPER will cause kernel NULL
pointer dereference without checking.

Signed-off-by: Xiang Liu <xiang.liu

drm/amdgpu: Check if CPER enabled when generating CPER

In the case of CPER disabled, generating CPER will cause kernel NULL
pointer dereference without checking.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc4
# 2f94469c 19-Feb-2025 Xiang Liu <[email protected]>

drm/amdgpu: Remove redundant check of adev

There is no need to check adev for sure.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex

drm/amdgpu: Remove redundant check of adev

There is no need to check adev for sure.

Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc3
# 652e0902 11-Feb-2025 Hawking Zhang <[email protected]>

drm/amdgpu: Generate cper records

Encode the error information in CPER format and commit
to the cper ring

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <keivnyang.wang

drm/amdgpu: Generate cper records

Encode the error information in CPER format and commit
to the cper ring

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1
# ad97840f 26-Jan-2025 Hawking Zhang <[email protected]>

drm/amdgpu: Introduce funcs for generating cper record

Introduce new functions that are used to generate
cper ue or ce records.

v2: return -ENOMEM instead of false
v2: check return value of fill se

drm/amdgpu: Introduce funcs for generating cper record

Introduce new functions that are used to generate
cper ue or ce records.

v2: return -ENOMEM instead of false
v2: check return value of fill section function

Signed-off-by: Hawking Zhang <[email protected]>
Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 56316ee9 26-Jan-2025 Hawking Zhang <[email protected]>

drm/amdgpu: Include ACA error type in aca bank

ACA error types managed by driver a direct 1:1
correspondence with those managed by firmware.

To address this, for each ACA bank, include
both the ACA

drm/amdgpu: Include ACA error type in aca bank

ACA error types managed by driver a direct 1:1
correspondence with those managed by firmware.

To address this, for each ACA bank, include
both the ACA error type and the ACA SMU type.

This addition is useful for creating CPER records.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7
# 2bb7dced 06-Nov-2024 Yang Wang <[email protected]>

drm/amdgpu: fix ACA bank count boundary check error

fix ACA bank count boundary check error.

Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Signed-off-by: Yang Wang <kevinya

drm/amdgpu: fix ACA bank count boundary check error

fix ACA bank count boundary check error.

Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11
# 0110ac11 11-Sep-2024 Yan Zhen <[email protected]>

drm/amdgpu: fix typo in the comment

Correctly spelled comments make it easier for the reader to understand
the code.

Replace 'udpate' with 'update' in the comment &
replace 'recieved' with 'receive

drm/amdgpu: fix typo in the comment

Correctly spelled comments make it easier for the reader to understand
the code.

Replace 'udpate' with 'update' in the comment &
replace 'recieved' with 'received' in the comment &
replace 'dsiable' with 'disable' in the comment &
replace 'Initiailize' with 'Initialize' in the comment &
replace 'disble' with 'disable' in the comment &
replace 'Disbale' with 'Disable' in the comment &
replace 'enogh' with 'enough' in the comment &
replace 'availabe' with 'available' in the comment.

Acked-by: Christian König <[email protected]>
Signed-off-by: Yan Zhen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5
# 4416377a 21-Aug-2024 Yang Wang <[email protected]>

drm/amdgpu: add list empty check to avoid null pointer issue

Add list empty check to avoid null pointer issues in some corner cases.
- list_for_each_entry_safe()

Signed-off-by: Yang Wang <kevinyang

drm/amdgpu: add list empty check to avoid null pointer issue

Add list empty check to avoid null pointer issues in some corner cases.
- list_for_each_entry_safe()

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2
# 671af066 02-Aug-2024 Yang Wang <[email protected]>

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is no longer used since following patch.

Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# 75ac6a25 25-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

v2:
add a warn log to show the location of function failure
when calling amdgpu_ras_mark_event(). (Tao Zhou)

v3:
change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.

v4:
rename amdgpu_ras_get_recovery_event() to
amdgpu_ras_get_fatal_error_event().

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc5
# a4fcb5f7 18-Jun-2024 Yang Wang <[email protected]>

Revert "drm/amdgpu: change aca bank error lock type to spinlock"

This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.

Revert this patch to modify lock type back to 'mutex' to avoid kernel

Revert "drm/amdgpu: change aca bank error lock type to spinlock"

This reverts commit f6bce954f432c556659a57be9e18fecdc575affb.

Revert this patch to modify lock type back to 'mutex' to avoid kernel
calltrace issue.

[ 602.668806] Workqueue: amdgpu-reset-dev amdgpu_ras_do_recovery [amdgpu]
[ 602.668939] Call Trace:
[ 602.668940] <TASK>
[ 602.668941] dump_stack_lvl+0x4c/0x70
[ 602.668945] dump_stack+0x14/0x20
[ 602.668946] __schedule_bug+0x5a/0x70
[ 602.668950] __schedule+0x940/0xb30
[ 602.668952] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.668955] ? hrtimer_reprogram+0x77/0xb0
[ 602.668957] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.668959] ? hrtimer_start_range_ns+0x126/0x370
[ 602.668961] schedule+0x39/0xe0
[ 602.668962] schedule_hrtimeout_range_clock+0xb1/0x140
[ 602.668964] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 602.668966] schedule_hrtimeout_range+0x17/0x20
[ 602.668967] usleep_range_state+0x69/0x90
[ 602.668970] psp_cmd_submit_buf+0x132/0x570 [amdgpu]
[ 602.669066] psp_ras_invoke+0x75/0x1a0 [amdgpu]
[ 602.669156] psp_ras_query_address+0x9c/0x120 [amdgpu]
[ 602.669245] umc_v12_0_update_ecc_status+0x16d/0x520 [amdgpu]
[ 602.669337] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669339] ? stack_depot_save+0x12/0x20
[ 602.669342] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669343] ? set_track_prepare+0x52/0x70
[ 602.669346] ? kmemleak_alloc+0x4f/0x90
[ 602.669348] ? __kmalloc_node+0x34b/0x450
[ 602.669352] amdgpu_umc_update_ecc_status+0x23/0x40 [amdgpu]
[ 602.669438] mca_umc_mca_get_err_count+0x85/0xc0 [amdgpu]
[ 602.669554] mca_smu_parse_mca_error_count+0x120/0x1d0 [amdgpu]
[ 602.669655] amdgpu_mca_dispatch_mca_set.part.0+0x141/0x250 [amdgpu]
[ 602.669743] ? kmemleak_free+0x36/0x60
[ 602.669745] ? kvfree+0x32/0x40
[ 602.669747] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669749] ? kfree+0x15d/0x2a0
[ 602.669752] amdgpu_mca_smu_log_ras_error+0x1f6/0x210 [amdgpu]
[ 602.669839] amdgpu_ras_query_error_status_helper+0x2ad/0x390 [amdgpu]
[ 602.669924] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669925] ? __call_rcu_common.constprop.0+0xa6/0x2b0
[ 602.669929] amdgpu_ras_query_error_status+0xf3/0x620 [amdgpu]
[ 602.670014] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.670017] amdgpu_ras_log_on_err_counter+0xe1/0x170 [amdgpu]
[ 602.670103] amdgpu_ras_do_recovery+0xd2/0x2c0 [amdgpu]
[ 602.670187] ? srso_alias_return_thunk+0x5/0

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: YiPeng Chai <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3
# 9817f061 04-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: move aca/mca init functions into ras_init() stage

adjust the function position to better match aca/mca fini code in ras_fini().

Signed-off-by: Yang Wang <[email protected]>
Reviewe

drm/amdgpu: move aca/mca init functions into ras_init() stage

adjust the function position to better match aca/mca fini code in ras_fini().

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1
# 062a7ce6 17-May-2024 Yang Wang <[email protected]>

drm/amdgpu: fix ACA no query result after gpu reset

fix ACA no query result after gpu reset.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-b

drm/amdgpu: fix ACA no query result after gpu reset

fix ACA no query result after gpu reset.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f6bce954 16-May-2024 Yang Wang <[email protected]>

drm/amdgpu: change aca bank error lock type to spinlock

modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-b

drm/amdgpu: change aca bank error lock type to spinlock

modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b2aa3d4b 14-May-2024 Yang Wang <[email protected]>

drm/amdgpu: add debug flag to enable RAS ACA

Use debug_mask=0x10 (BIT.4) param to help enable RAS ACA.
(RAS ACA is disabled by default.)

Signed-off-by: Yang Wang <[email protected]>
Reviewed-b

drm/amdgpu: add debug flag to enable RAS ACA

Use debug_mask=0x10 (BIT.4) param to help enable RAS ACA.
(RAS ACA is disabled by default.)

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9
# e6ae021a 09-May-2024 Jesse Zhang <[email protected]>

drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type

Filter invalid aca error types before performing a shift operation.

Signed-off-by: Jesse Zhang <[email protected]>
Revi

drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type

Filter invalid aca error types before performing a shift operation.

Signed-off-by: Jesse Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc7
# a6bcffa5 30-Apr-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Add smu v13_0_14 ip block

Add smu v13_0_14 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <alexande

drm/amdgpu: Add smu v13_0_14 ip block

Add smu v13_0_14 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4
# f2355862 12-Apr-2024 Yang Wang <[email protected]>

drm/amdgpu: add new aca smu callback func parse_error_code()

add new aca smu callback parse_error_code{} to avoid specific asic check
in amdgpu_aca.c file

Signed-off-by: Yang Wang <kevinyang.wang@a

drm/amdgpu: add new aca smu callback func parse_error_code()

add new aca smu callback parse_error_code{} to avoid specific asic check
in amdgpu_aca.c file

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc3, v6.9-rc2
# 81d96e8b 28-Mar-2024 Yang Wang <[email protected]>

drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()

refine function signature of amdgpu_aca_get_error_data();

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zho

drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()

refine function signature of amdgpu_aca_get_error_data();

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc1
# 31fd330b 18-Mar-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras event id support for ACA

add ras event id support for ACA.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao

drm/amdgpu: add ras event id support for ACA

add ras event id support for ACA.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7
# bd15bf74 03-Mar-2024 Yang Wang <[email protected]>

drm/amdgpu: avoid update aca bank multi times during ras isr

Because the UE Valid MCA count will only be cleared after reset,
in order to avoid repeated counting of the error count,
the aca bank is

drm/amdgpu: avoid update aca bank multi times during ras isr

Because the UE Valid MCA count will only be cleared after reset,
in order to avoid repeated counting of the error count,
the aca bank is only updated once during ras isr.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8-rc6
# 865d3397 21-Feb-2024 Yang Wang <[email protected]>

drm/amdgpu: add aca deferred error type support

add aca deferred error type support

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-b

drm/amdgpu: add aca deferred error type support

add aca deferred error type support

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# e3d4de8d 22-Feb-2024 Yang Wang <[email protected]>

drm/amdgpu: retire unused aca_bank_report data structure

retire unused aca_bank_report data structure.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <Hawking.Zhang@am

drm/amdgpu: retire unused aca_bank_report data structure

retire unused aca_bank_report data structure.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


12