History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c (Results 1 – 25 of 33)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5
# 7daa0f6b 25-Oct-2024 Yang Wang <[email protected]>

drm/amdgpu: optimize ACA log print

- skip to print CE ACA log.
- optimize ACA log print for MCA.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>

drm/amdgpu: optimize ACA log print

- skip to print CE ACA log.
- optimize ACA log print for MCA.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# 671af066 02-Aug-2024 Yang Wang <[email protected]>

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is

drm/amdgpu: remove RAS unused paramter 'err_addr'

- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is no longer used since following patch.

Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# 75ac6a25 25-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

drm/amdgpu: refine amdgpu ras event id core code

v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.

v2:
add a warn log to show the location of function failure
when calling amdgpu_ras_mark_event(). (Tao Zhou)

v3:
change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.

v4:
rename amdgpu_ras_get_recovery_event() to
amdgpu_ras_get_fatal_error_event().

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc5
# 8c9ee180 18-Jun-2024 Yang Wang <[email protected]>

Revert "drm/amdgpu: change bank cache lock type to spinlock"

This reverts commit 258ed689bc3163f86204f75df6c23f92b59b3fad

revert this patch to modify lock type back to 'mutex' to avoid kernel
callt

Revert "drm/amdgpu: change bank cache lock type to spinlock"

This reverts commit 258ed689bc3163f86204f75df6c23f92b59b3fad

revert this patch to modify lock type back to 'mutex' to avoid kernel
calltrace issue.

[ 602.668806] Workqueue: amdgpu-reset-dev amdgpu_ras_do_recovery [amdgpu]
[ 602.668939] Call Trace:
[ 602.668940] <TASK>
[ 602.668941] dump_stack_lvl+0x4c/0x70
[ 602.668945] dump_stack+0x14/0x20
[ 602.668946] __schedule_bug+0x5a/0x70
[ 602.668950] __schedule+0x940/0xb30
[ 602.668952] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.668955] ? hrtimer_reprogram+0x77/0xb0
[ 602.668957] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.668959] ? hrtimer_start_range_ns+0x126/0x370
[ 602.668961] schedule+0x39/0xe0
[ 602.668962] schedule_hrtimeout_range_clock+0xb1/0x140
[ 602.668964] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 602.668966] schedule_hrtimeout_range+0x17/0x20
[ 602.668967] usleep_range_state+0x69/0x90
[ 602.668970] psp_cmd_submit_buf+0x132/0x570 [amdgpu]
[ 602.669066] psp_ras_invoke+0x75/0x1a0 [amdgpu]
[ 602.669156] psp_ras_query_address+0x9c/0x120 [amdgpu]
[ 602.669245] umc_v12_0_update_ecc_status+0x16d/0x520 [amdgpu]
[ 602.669337] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669339] ? stack_depot_save+0x12/0x20
[ 602.669342] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669343] ? set_track_prepare+0x52/0x70
[ 602.669346] ? kmemleak_alloc+0x4f/0x90
[ 602.669348] ? __kmalloc_node+0x34b/0x450
[ 602.669352] amdgpu_umc_update_ecc_status+0x23/0x40 [amdgpu]
[ 602.669438] mca_umc_mca_get_err_count+0x85/0xc0 [amdgpu]
[ 602.669554] mca_smu_parse_mca_error_count+0x120/0x1d0 [amdgpu]
[ 602.669655] amdgpu_mca_dispatch_mca_set.part.0+0x141/0x250 [amdgpu]
[ 602.669743] ? kmemleak_free+0x36/0x60
[ 602.669745] ? kvfree+0x32/0x40
[ 602.669747] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669749] ? kfree+0x15d/0x2a0
[ 602.669752] amdgpu_mca_smu_log_ras_error+0x1f6/0x210 [amdgpu]
[ 602.669839] amdgpu_ras_query_error_status_helper+0x2ad/0x390 [amdgpu]
[ 602.669924] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.669925] ? __call_rcu_common.constprop.0+0xa6/0x2b0
[ 602.669929] amdgpu_ras_query_error_status+0xf3/0x620 [amdgpu]
[ 602.670014] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.670017] amdgpu_ras_log_on_err_counter+0xe1/0x170 [amdgpu]
[ 602.670103] amdgpu_ras_do_recovery+0xd2/0x2c0 [amdgpu]
[ 602.670187] ? srso_alias_return_thunk+0x5/0xfbef5
[ 602.670189] ? __schedule+0x37d/0xb30
[ 602.670191] process_one_work+0x176/0x350
[ 602.670194] worker_thread+0x2f7/0x420
[ 602.670197] ?

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: YiPeng Chai <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3
# 9817f061 04-Jun-2024 Yang Wang <[email protected]>

drm/amdgpu: move aca/mca init functions into ras_init() stage

adjust the function position to better match aca/mca fini code in ras_fini().

Signed-off-by: Yang Wang <[email protected]>
Reviewe

drm/amdgpu: move aca/mca init functions into ras_init() stage

adjust the function position to better match aca/mca fini code in ras_fini().

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc2, v6.10-rc1
# 258ed689 16-May-2024 Yang Wang <[email protected]>

drm/amdgpu: change bank cache lock type to spinlock

modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: T

drm/amdgpu: change bank cache lock type to spinlock

modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9
# 85a24a3e 06-May-2024 Yang Wang <[email protected]>

drm/amdgpu: ignoring unsupported ras blocks when MCA bank dispatches

This patch is used to solve the problem of incorrect parsing of error counts.
When the UE trigger gpu is reset, the driver will a

drm/amdgpu: ignoring unsupported ras blocks when MCA bank dispatches

This patch is used to solve the problem of incorrect parsing of error counts.
When the UE trigger gpu is reset, the driver will attempt to parse all possible ras blocks.
For ras blocks that are not supported by the current ASIC, the driver should ignore this error.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Candice Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc7
# a6bcffa5 30-Apr-2024 Hawking Zhang <[email protected]>

drm/amdgpu: Add smu v13_0_14 ip block

Add smu v13_0_14 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <alexande

drm/amdgpu: Add smu v13_0_14 ip block

Add smu v13_0_14 ip block support

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Le Ma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc6
# 5eccab32 23-Apr-2024 Yang Wang <[email protected]>

drm/amdgpu: avoid dump mca bank log muti times during ras ISR

because the ue valid mca count will only be cleared after gpu reset,
so only dump mca log on the first time to get mca bank after receiv

drm/amdgpu: avoid dump mca bank log muti times during ras ISR

because the ue valid mca count will only be cleared after gpu reset,
so only dump mca log on the first time to get mca bank after receive RAS interrupt.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc5
# 76ad30f5 18-Apr-2024 Yang Wang <[email protected]>

drm/amdgpu: add MCA smu cache support

v1:
because SMU CE valid mca bank will be cleared after reading,
this patch adds mca cache at the driver level to ensure that the mca bank is not lost.

v2:
ref

drm/amdgpu: add MCA smu cache support

v1:
because SMU CE valid mca bank will be cleared after reading,
this patch adds mca cache at the driver level to ensure that the mca bank is not lost.

v2:
refine amdgpu_mca_init/fini/reset() function name.

v3:
add mca_cache.lock support
only add CE bank to mca bank cache.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 8fb20d95 18-Apr-2024 Yang Wang <[email protected]>

drm/amdgpu: add amdgpu MCA bank dispatch function support

- Refine mca driver code.
- Centralize mca bank dispatch code logic.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zho

drm/amdgpu: add amdgpu MCA bank dispatch function support

- Refine mca driver code.
- Centralize mca bank dispatch code logic.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 7e0357be 18-Apr-2024 Yang Wang <[email protected]>

drm/amdgpu: remove unused MCA driver codes

- remove unused callback functions.
- make part of mca functions static and refine the function order.

Signed-off-by: Yang Wang <[email protected]>
R

drm/amdgpu: remove unused MCA driver codes

- remove unused callback functions.
- make part of mca functions static and refine the function order.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 9dc57c2a 13-Mar-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras event id support

add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_

drm/amdgpu: add ras event id support

add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3
# 788686e2 29-Jan-2024 Yang Wang <[email protected]>

drm/amdgpu: use helper macro HW_ERR instead of Hardware error string

use helper macro HW_ERR to instead of Hardware error string.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawk

drm/amdgpu: use helper macro HW_ERR instead of Hardware error string

use helper macro HW_ERR to instead of Hardware error string.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8-rc2, v6.8-rc1
# afb617f3 15-Jan-2024 YiPeng Chai <[email protected]>

drm/amdgpu: add interface to check mca umc status

Add interface to check mca umc status.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-o

drm/amdgpu: add interface to check mca umc status

Add interface to check mca umc status.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.7
# 46e2231c 03-Jan-2024 Candice Li <[email protected]>

drm/amdgpu: Log deferred error separately

Separate deferred error from UE and CE and log it
individually.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <Hawking.Zhang@am

drm/amdgpu: Log deferred error separately

Separate deferred error from UE and CE and log it
individually.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 4f32504a 03-Jan-2024 Srinivasan Shanmugam <[email protected]>

drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in 'amdgpu_mca_smu_get_mca_entry()'

Fixes the below:

drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c:377 amdgpu_mca_smu_get_mca_entry() w

drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in 'amdgpu_mca_smu_get_mca_entry()'

Fixes the below:

drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c:377 amdgpu_mca_smu_get_mca_entry() warn: variable dereferenced before check 'mca_funcs' (see line 368)

357 int amdgpu_mca_smu_get_mca_entry(struct amdgpu_device *adev,
enum amdgpu_mca_error_type type,
358 int idx, struct mca_bank_entry *entry)
359 {
360 const struct amdgpu_mca_smu_funcs *mca_funcs =
adev->mca.mca_funcs;
361 int count;
362
363 switch (type) {
364 case AMDGPU_MCA_ERROR_TYPE_UE:
365 count = mca_funcs->max_ue_count;

mca_funcs is dereferenced here.

366 break;
367 case AMDGPU_MCA_ERROR_TYPE_CE:
368 count = mca_funcs->max_ce_count;

mca_funcs is dereferenced here.

369 break;
370 default:
371 return -EINVAL;
372 }
373
374 if (idx >= count)
375 return -EINVAL;
376
377 if (mca_funcs && mca_funcs->mca_get_mca_entry)
^^^^^^^^^

Checked too late!

Cc: Yang Wang <[email protected]>
Cc: Hawking Zhang <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.7-rc8, v6.7-rc7, v6.7-rc6
# 9f91e983 12-Dec-2023 YiPeng Chai <[email protected]>

drm/amdgpu: MCA supports recording umc address information

MCA supports recording umc address information.

V2:
Move err_addr variable from struct ras_err_node to
struct ras_err_info.

Signed-off-

drm/amdgpu: MCA supports recording umc address information

MCA supports recording umc address information.

V2:
Move err_addr variable from struct ras_err_node to
struct ras_err_info.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.7-rc5, v6.7-rc4
# 00f9d49b 30-Nov-2023 Yang Wang <[email protected]>

drm/amdgpu: Fix missing mca debugfs node

Use amdgpu_ip_version() helper function to check ip version.

The ip version contains other information,
use the helper function to avoid reading wrong value

drm/amdgpu: Fix missing mca debugfs node

Use amdgpu_ip_version() helper function to check ip version.

The ip version contains other information,
use the helper function to avoid reading wrong value.

Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1
# 201761b5 09-Nov-2023 Lijo Lazar <[email protected]>

drm/amdgpu: Move mca debug mode decision to ras

Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.

By default mca debug mode is set to false.

v2: squash in

drm/amdgpu: Move mca debug mode decision to ras

Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.

By default mca debug mode is set to false.

v2: squash in uninitialized value fix (Alex)

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 8140b07b 08-Nov-2023 Yang Wang <[email protected]>

drm/amdgpu: correct mca debugfs dump reg list

avoid driver to touch invalid mca reg.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-

drm/amdgpu: correct mca debugfs dump reg list

avoid driver to touch invalid mca reg.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# d406aec8 08-Nov-2023 Hawking Zhang <[email protected]>

drm/amdgpu: correct acclerator check architecutre dump

So driver doesn't touch invalid aca entries.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]

drm/amdgpu: correct acclerator check architecutre dump

So driver doesn't touch invalid aca entries.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 07c1db70 02-Nov-2023 Yang Wang <[email protected]>

drm/amdgpu: refine smu v13.0.6 mca dump driver

refine smu mca driver to support query ras error from pmfw path.
- correct gfx smu bank hwid (from mp5 to smu bank)
- retire unused callback function i

drm/amdgpu: refine smu v13.0.6 mca dump driver

refine smu mca driver to support query ras error from pmfw path.
- correct gfx smu bank hwid (from mp5 to smu bank)
- retire unused callback function in amdgpu_mca_smu_funcs{}
- add new mca_bank_set{} structure to collect mca bank
- move enum mca_reg_idx into amdgpu_mca.h header
- add mca status register field decode macro

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1
# 4051844c 06-Sep-2023 Yang Wang <[email protected]>

drm/amdgpu: add amdgpu mca debug sysfs support

add amdgpu mca debug sysfs support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by

drm/amdgpu: add amdgpu mca debug sysfs support

add amdgpu mca debug sysfs support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 7ff607e2 05-Sep-2023 Yang Wang <[email protected]>

drm/amdgpu: add amdgpu smu mca dump feature support

add amdgpu smu mca dump feature support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Sig

drm/amdgpu: add amdgpu smu mca dump feature support

add amdgpu smu mca dump feature support.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


12