History log of /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h (Results 1 – 25 of 42)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# 485993e2 06-Feb-2025 Lijo Lazar <[email protected]>

drm/amdgpu: Add xgmi speed/width related info

Add APIs to initialize XGMI speed, width details and get to max
bandwidth supported. It is assumed that a device only supports same
generation of XGMI l

drm/amdgpu: Add xgmi speed/width related info

Add APIs to initialize XGMI speed, width details and get to max
bandwidth supported. It is assumed that a device only supports same
generation of XGMI links with uniform width.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 6f16d101 06-Feb-2025 Lijo Lazar <[email protected]>

drm/amdgpu: Move xgmi definitions to xgmi header

Move definitions related to xgmi to amdgpu_xgmi header

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.

drm/amdgpu: Move xgmi definitions to xgmi header

Move definitions related to xgmi to amdgpu_xgmi header

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 9424a5bf 10-Feb-2025 Jonathan Kim <[email protected]>

drm/amdgpu: simplify xgmi peer info calls

Deprecate KFD XGMI peer info calls in favour of calling directly from
simplified XGMI peer info functions.

Signed-off-by: Jonathan Kim <[email protected]

drm/amdgpu: simplify xgmi peer info calls

Deprecate KFD XGMI peer info calls in favour of calling directly from
simplified XGMI peer info functions.

Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12
# 466a59ab 11-Nov-2024 Asad Kamal <[email protected]>

drm/amd/pm: Get xgmi link status for XGMI_v_6_4_0

Get XGMI_v_6_4_0 link status and populate it to metrics v1_7 for
SMU_v_13_0_6

v2: Get link status register value for each soc from separate
functio

drm/amd/pm: Get xgmi link status for XGMI_v_6_4_0

Get XGMI_v_6_4_0 link status and populate it to metrics v1_7 for
SMU_v_13_0_6

v2: Get link status register value for each soc from separate
function (Lijo)

Signed-off-by: Asad Kamal <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1
# e46738a5 20-Sep-2024 Jonathan Kim <[email protected]>

drm/amdkfd: sever xgmi io link if host driver has disable sharing

Host drivers can create partial hives per guest by disabling xgmi sharing
between certain peers in the main hive.
Typically, these p

drm/amdkfd: sever xgmi io link if host driver has disable sharing

Host drivers can create partial hives per guest by disabling xgmi sharing
between certain peers in the main hive.
Typically, these partial hives are fully connected per guest session.
In the event that the host makes a mistake by adding a non-shared node
to a guest session, have the KFD reflect sharing disabled by severing
the IO link.

Signed-off-by: Jonathan Kim <[email protected]>
Tested-by: James Yao <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# ee52489d 20-Sep-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Place NPS mode request on unload

If a user has requested NPS mode switch, place the request through PSP
during unload of the driver. For devices which are part of a hive, all
requests ar

drm/amdgpu: Place NPS mode request on unload

If a user has requested NPS mode switch, place the request through PSP
during unload of the driver. For devices which are part of a hive, all
requests are placed together. If one of them fails, revert back to the
current NPS mode.

Signed-off-by: Lijo Lazar <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# bbc16008 19-Sep-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Add gmc interface to request NPS mode

Add a common interface in GMC to request NPS mode through PSP. Also add
a variable in hive and gmc control to track the last requested mode.

Signed

drm/amdgpu: Add gmc interface to request NPS mode

Add a common interface in GMC to request NPS mode through PSP. Also add
a variable in hive and gmc control to track the last requested mode.

Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11, v6.11-rc7, v6.11-rc6
# 631af731 26-Aug-2024 Lijo Lazar <[email protected]>

drm/amdgpu: Refactor XGMI reset on init handling

Use XGMI hive information to rely on resetting XGMI devices on
initialization rather than using mgpu structure. mgpu structure may have
other devices

drm/amdgpu: Refactor XGMI reset on init handling

Use XGMI hive information to rely on resetting XGMI devices on
initialization rather than using mgpu structure. mgpu structure may have
other devices as well.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# 9dc57c2a 13-Mar-2024 Yang Wang <[email protected]>

drm/amdgpu: add ras event id support

add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_

drm/amdgpu: add ras event id support

add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1
# fb1c93c2 10-Jan-2024 Christian König <[email protected]>

drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"

Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the
HW in an active state and is an unbalanced use of the IP callback

drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"

Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the
HW in an active state and is an unbalanced use of the IP callbacks.

Using the IP callbacks like this can lead to memory leaks, double
free and imbalanced reference counters.

Leaving the HW in an active state can lead to DMA accesses to memory now
freed by the driver.

Both is a complete no-go for driver unload so completely revert the
workaround for now.

This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04.

Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

show more ...


# 087a3e13 10-Jan-2024 Christian König <[email protected]>

drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"

Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the
HW in an active state and is an unbalanced use of the IP callback

drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"

Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the
HW in an active state and is an unbalanced use of the IP callbacks.

Using the IP callbacks like this can lead to memory leaks, double
free and imbalanced reference counters.

Leaving the HW in an active state can lead to DMA accesses to memory now
freed by the driver.

Both is a complete no-go for driver unload so completely revert the
workaround for now.

This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04.

Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5
# 53dd920c 05-Oct-2023 Asad Kamal <[email protected]>

drm/amdgpu : Add hive ras recovery check

If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in h

drm/amdgpu : Add hive ras recovery check

If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery

Signed-off-by: Asad Kamal <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1
# da9d669e 04-Mar-2023 Hawking Zhang <[email protected]>

drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init

To align with other IP blocks.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Stanley Yang <[email protected]>
Reviewed-by: Tao Zh

drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init

To align with other IP blocks.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Stanley Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5
# f5c7e779 07-Sep-2022 YiPeng Chai <[email protected]>

drm/amdgpu: Adjust removal control flow for smu v13_0_2

Adjust removal control flow for smu v13_0_2:
During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a

drm/amdgpu: Adjust removal control flow for smu v13_0_2

Adjust removal control flow for smu v13_0_2:
During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
a global variable.

V3:
1. Update commit comments.
2. Split a patch into multiple patches.
3. The current patch does:
a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
the existing gpu recover path, which make all devices
in hive list only have HW reset but no resume (except
the base IP).
b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
in amdgpu_pci_remove when removing the first device in
hive list.
c. When removing the first device, the IP blocks keyword
function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
^ |
|-<----------<---------<----|
The first three sequences are because of a call to
amdgpu_device_gpu_recover. The three sequences will be
executed in a loop until all devices in the hive list
are iterated.
The sequences starting from .hw_fini only apply to the
first device. Since .suspend has been called before,
except the resumed phase1 basic ip blocks, all other ip
blocks .hw_fini of current device will do nothing.
d. When removing other devices, the calling sequences is the
same as legacy:
.hw_fini -> .early_fini -> .sw_fini.
Since .suspend has been called when removing the first device,
except the resumed phase1 basic ip blocks, all of other ip
blocks .hw_fini of current device will do nothing.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6
# 158a05a0 23-Feb-2022 Alex Sierra <[email protected]>

drm/amdgpu: Add use_xgmi_p2p module parameter

This parameter controls xGMI p2p communication, which is enabled by
default. However, it can be disabled by setting it to 0. In case xGMI
p2p is disable

drm/amdgpu: Add use_xgmi_p2p module parameter

This parameter controls xGMI p2p communication, which is enabled by
default. However, it can be disabled by setting it to 0. In case xGMI
p2p is disabled in a dGPU, PCIe p2p interface will be used instead.
This parameter is ignored in GPUs that do not support xGMI
p2p configuration.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Luben Tuikov <[email protected]>
Acked-by: Harish Kasiviswanathan <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1
# cfbb6b00 21-Jan-2022 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Rework reset domain to be refcounted.

The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be bind

drm/amdgpu: Rework reset domain to be refcounted.

The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.

v4:

Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.

Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)

Signed-off-by: Andrey Grodzovsky <[email protected]>
Acked-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html

show more ...


Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6
# 681260df 15-Dec-2021 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Drop hive->in_reset

Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <c

drm/amdgpu: Drop hive->in_reset

Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74115.html

show more ...


Revision tags: v5.16-rc5, v5.16-rc4
# a4c63caf 30-Nov-2021 Andrey Grodzovsky <[email protected]>

drm/amdgpu: Introduce reset domain

Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XG

drm/amdgpu: Introduce reset domain

Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.

Signed-off-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Daniel Vetter <[email protected]>
Suggested-by: Christian König <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://www.spinics.net/lists/amd-gfx/msg74111.html

show more ...


# 6c245386 04-Jan-2022 yipechai <[email protected]>

drm/amdgpu: Modify xgmi block to fit for the unified ras block data and ops

1.Modify gmc block to fit for the unified ras block data and ops.
2.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and t

drm/amdgpu: Modify xgmi block to fit for the unified ras block data and ops

1.Modify gmc block to fit for the unified ras block data and ops.
2.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of gmc ras variable so that gmc ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register gmc ras block into amdgpu device ras block link list.
5.Remove the redundant code about gmc in amdgpu_ras.c after using the unified ras block.

Signed-off-by: yipechai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: John Clements <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2
# 3f46c4e9 12-May-2021 Jonathan Kim <[email protected]>

drm/amdkfd: report xgmi bandwidth between direct peers to the kfd

Report the min/max bandwidth in megabytes to the kfd for direct
xgmi connections only. Indirect peers will report 0 since
indirect

drm/amdkfd: report xgmi bandwidth between direct peers to the kfd

Report the min/max bandwidth in megabytes to the kfd for direct
xgmi connections only. Indirect peers will report 0 since
indirect route is unknown.

Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4
# 52137ca8 18-Mar-2021 Hawking Zhang <[email protected]>

drm/amdgpu: move xgmi ras functions to xgmi_ras_funcs

xgmi ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. move all xgmi ras
functions to xgmi_ras_funcs so gpu driver onl

drm/amdgpu: move xgmi ras functions to xgmi_ras_funcs

xgmi ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. move all xgmi ras
functions to xgmi_ras_funcs so gpu driver only
initializes xgmi ras functions when it manages
xgmi ras.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2
# d95e8e97 18-Aug-2020 Dennis Li <[email protected]>

drm/amdgpu: refine create and release logic of hive info

Change to dynamically create and release hive info object,
which help driver support more hives in the future.

v2:
Change to save hive objec

drm/amdgpu: refine create and release logic of hive info

Change to dynamically create and release hive info object,
which help driver support more hives in the future.

v2:
Change to save hive object pointer in adev, to avoid locking
xgmi_mutex every time when calling amdgpu_get_xgmi_hive.

v3:
1. Change type of hive object pointer in adev from void* to
amdgpu_hive_info*.
2. remove unnecessary variable initialization.

Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


# 53b3f8f4 19-Aug-2020 Dennis Li <[email protected]>

drm/amdgpu: refine codes to avoid reentering GPU recovery

if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_r

drm/amdgpu: refine codes to avoid reentering GPU recovery

if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.

v2:
drop "? true : false" in the definition of amdgpu_in_reset

Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.9-rc1
# f1403342 12-Aug-2020 Christian König <[email protected]>

drm/amdgpu: revert "fix system hang issue during GPU reset"

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems

drm/amdgpu: revert "fix system hang issue during GPU reset"

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929.

Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


Revision tags: v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5
# df9c8d1a 08-Jul-2020 Dennis Li <[email protected]>

drm/amdgpu: fix system hang issue during GPU reset

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-ent

drm/amdgpu: fix system hang issue during GPU reset

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249] dump_stack+0x98/0xd5
[ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833] free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831] ksys_ioctl+0x98/0xb0
[ 1230.204004] __x64_sys_ioctl+0x1a/0x20
[ 1230.205174] do_syscall_64+0x5f/0x250
[ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang <[email protected]>
Suggested-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Christian König <[email protected]>
Suggested-by: Felix Kuehling <[email protected]>
Suggested-by: Lijo Lazar <[email protected]>
Suggested-by: Luben Tukov <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

show more ...


12