History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c (Results 1 – 25 of 66)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# cbda2758 24-Jun-2024 Vignesh Chander <[email protected]>

drm/amdgpu: process RAS fatal error MB notification

For RAS error scenario, VF guest driver will check mailbox
and set fed flag to avoid unnecessary HW accesses.
additionally, poll for reset complet

drm/amdgpu: process RAS fatal error MB notification

For RAS error scenario, VF guest driver will check mailbox
and set fed flag to avoid unnecessary HW accesses.
additionally, poll for reset completion message first
to avoid accidentally spamming multiple reset requests to host.

v2: add another mailbox check for handling case where kfd detects
timeout first

v3: set host_flr bit and use wait_for_reset

Signed-off-by: Vignesh Chander <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc5
# 29b6985d 16-Jun-2024 Vignesh Chander <[email protected]>

drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter

So we can get clearer per device logging.

Signed-off-by: Vignesh Chander <[email protected]>
Reviewed-by: Zhigang L

drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter

So we can get clearer per device logging.

Signed-off-by: Vignesh Chander <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1
# 5c0a1cdd 24-May-2024 Yunxiang Li <[email protected]>

drm/amdgpu: fix sriov host flr handler

We send back the ready to reset message before we stop anything. This is
wrong. Move it to when we are actually ready for the FLR to happen.

In the current st

drm/amdgpu: fix sriov host flr handler

We send back the ready to reset message before we stop anything. This is
wrong. Move it to when we are actually ready for the FLR to happen.

In the current state since we take tens of seconds to stop everything,
it is very likely that host would give up waiting and reset the GPU
before we send ready, so it would be the same as before. But this gets
rid of the hack with reset_domain locking and also let us tell how slow
ready to reset actually is from the host. The ready to reset speed can
be improved later.

Signed-off-by: Yunxiang Li <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9, v6.9-rc7, v6.9-rc6
# 25c01191 22-Apr-2024 Yunxiang Li <[email protected]>

drm/amdgpu: Add reset_context flag for host FLR

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FL

drm/amdgpu: Add reset_context flag for host FLR

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# f4322b9f 22-Apr-2024 Yunxiang Li <[email protected]>

drm/amdgpu: Fix two reset triggered in a row

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule a

drm/amdgpu: Fix two reset triggered in a row

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7
# ab66c832 29-Feb-2024 Zhigang Luo <[email protected]>

drm/amdgpu: trigger flr_work if reading pf2vf data failed

if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.

also use de

drm/amdgpu: trigger flr_work if reading pf2vf data failed

if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.

also use dev_err to print the error message to get which device has
issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL
timeout.

Signed-off-by: Zhigang Luo <[email protected]>
Acked-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2
# ed1e1e42 23-Jan-2024 YiPeng Chai <[email protected]>

drm/amdgpu: Support passing poison consumption ras block to SRIOV

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang

drm/amdgpu: Support passing poison consumption ras block to SRIOV

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19
# 8ede944d 29-Jul-2022 Tao Zhou <[email protected]>

drm/amdgpu: add RAS poison consumption handler for AI SRIOV

Send message to host and host will handle it.

v2: split the patch into two parts, one is for mxgpu ai and another one
is for common poiso

drm/amdgpu: add RAS poison consumption handler for AI SRIOV

Send message to host and host will handle it.

v2: split the patch into two parts, one is for mxgpu ai and another one
is for common poison consumption handler.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# a340847b 13-Oct-2022 Victor Zhao <[email protected]>

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b98a1648 13-Oct-2022 Victor Zhao <[email protected]>

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# dac6b808 28-Jul-2022 Victor Zhao <[email protected]>

drm/amdgpu: let mode2 reset fallback to default when failure

- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the as

drm/amdgpu: let mode2 reset fallback to default when failure

- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao <[email protected]>
Acked-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.19-rc8, v5.19-rc7, v5.19-rc6
# f1549c09 08-Jul-2022 Likun Gao <[email protected]>

drm/amdgpu: support reset flag set for gpu reset

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, for

drm/amdgpu: support reset flag set for gpu reset

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18
# cf727044 17-May-2022 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Sig

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1
# 89a7a870 19-Jan-2022 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Move in_gpu_reset into reset_domain

We should have a single instance per entrire reset domain.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Lijo Lazar <lij

drm/amdgpu: Move in_gpu_reset into reset_domain

We should have a single instance per entrire reset domain.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Lijo Lazar <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html

show more ...


# d0fb18b5 19-Jan-2022 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Move reset sem into reset_domain

We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could b

drm/amdgpu: Move reset sem into reset_domain

We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74117.html

show more ...


# cfbb6b00 21-Jan-2022 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Rework reset domain to be refcounted.

The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be bind

drm/amdgpu: Rework reset domain to be refcounted.

The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.

v4:

Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.

Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)

Signed-off-by: Andrey Grodzovsky <[email protected]>
Acked-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html

show more ...


Revision tags: v5.16, v5.16-rc8, v5.16-rc7
# 02599bc7 20-Dec-2021 Andrey Grodzovsky <[email protected]>

drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
followin

drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.

Suggested-by: Liu Shaoyun <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Liu Shaoyun <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74114.html

show more ...


# 216a9873 29-Dec-2021 James Yao <[email protected]>

drm/amdgpu: add dummy event6 for vega10

[why]
Malicious mailbox event1 fails driver loading on vega10.
A dummy event6 prevent driver from taking response from malicious event1 as its own.

[how]
On

drm/amdgpu: add dummy event6 for vega10

[why]
Malicious mailbox event1 fails driver loading on vega10.
A dummy event6 prevent driver from taking response from malicious event1 as its own.

[how]
On vega10, send a mailbox event6 before sending event1.

Signed-off-by: James Yao <[email protected]>
Reviewed-by: Jingwen Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.16-rc6
# fa4a427d 13-Dec-2021 Victor Skvortsov <[email protected]>

drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is
already holding a read_lock. Change from down_write_trylock
to down_write to guarantee the reset go

drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is
already holding a read_lock. Change from down_write_trylock
to down_write to guarantee the reset goes through.

Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed by: Shaoyun.liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14
# 64261a0d 27-Aug-2021 YuBiao Wang <[email protected]>

drm/amd/amdgpu: Add ready_to_reset resp for vega10

Send response to host after received the flr notification from host.
Port NV change to vega10.

Signed-off-by: YuBiao Wang <[email protected]>
Re

drm/amd/amdgpu: Add ready_to_reset resp for vega10

Send response to host after received the flr notification from host.
Port NV change to vega10.

Signed-off-by: YuBiao Wang <[email protected]>
Reviewed-by: Jingwen Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1
# 798c5115 01-Jul-2021 Jingwen Chen <[email protected]>

drm/amdgpu: SRIOV flr_work should take write_lock

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take

drm/amdgpu: SRIOV flr_work should take write_lock

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take write_lock to avoid this case.

Signed-off-by: Jingwen Chen <[email protected]>
Reviewed-by: Monk Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# b5840166 01-Jul-2021 Jingwen Chen <[email protected]>

drm/amdgpu: SRIOV flr_work should take write_lock

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take

drm/amdgpu: SRIOV flr_work should take write_lock

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take write_lock to avoid this case.

Signed-off-by: Jingwen Chen <[email protected]>
Reviewed-by: Monk Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3
# 3c2a01cb 07-Jan-2021 Jack Zhang <[email protected]>

drm/amdgpu/sriov Stop data exchange for wholegpu reset

[Why]
When host trigger a whole gpu reset, guest will keep
waiting till host finish reset. But there's a work
queue in guest exchanging data be

drm/amdgpu/sriov Stop data exchange for wholegpu reset

[Why]
When host trigger a whole gpu reset, guest will keep
waiting till host finish reset. But there's a work
queue in guest exchanging data between vf&pf which need
to access frame buffer. During whole gpu reset, frame
buffer is not accessable, and this causes the call trace.

[How]
After vf get reset notification from pf, stop data exchange.

Signed-off-by: Jingwen Chen <[email protected]>
Signed-off-by: Jack Zhang <[email protected]>
Reviewed-by: Monk Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6
# 3aa883ac 25-Nov-2020 Jiange Zhao <[email protected]>

drm/amdgpu/SRIOV: Extend VF reset request wait period

In Virtualization case, when one VF is sending too many
FLR requests, hypervisor would stop responding to this
VF's request for a long period of

drm/amdgpu/SRIOV: Extend VF reset request wait period

In Virtualization case, when one VF is sending too many
FLR requests, hypervisor would stop responding to this
VF's request for a long period of time. This is called
event guard. During this period of cooling time, guest
driver should wait instead of doing other things. After
this period of time, guest driver would resume reset
process and return to normal.

Currently, guest driver would wait 12 seconds and return fail
if it doesn't get response from host.

Solution: extend this waiting time in guest driver and poll
response periodically. Poll happens every 6 seconds and it will
last for 60 seconds.

v2: change the max repetition times from number to macro.

Signed-off-by: Jiange Zhao <[email protected]>
Acked-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5
# 2a9787dc 09-Sep-2020 Liu ChengZhe <[email protected]>

drm/amdgpu: Do gpu recovery when no job is running

In function flr_work, we should do gpu recovery when no job
is running. Fix the logic by inverting it.

v2: modify the description

Reviewed-by: Ch

drm/amdgpu: Do gpu recovery when no job is running

In function flr_work, we should do gpu recovery when no job
is running. Fix the logic by inverting it.

v2: modify the description

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Liu ChengZhe <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


123