History log of /linux-6.15/arch/powerpc/kernel/eeh_driver.c (Results 1 – 25 of 120)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6
# d1679b4f 22-Apr-2024 Ganesh Goudar <[email protected]>

powerpc/eeh: Permanently disable the removed device

When a device is hot removed on powernv, the hotplug driver clears
the device's state. However, on pseries, if a device is removed by
phyp after r

powerpc/eeh: Permanently disable the removed device

When a device is hot removed on powernv, the hotplug driver clears
the device's state. However, on pseries, if a device is removed by
phyp after reaching the error threshold, the kernel remains unaware,
leading to the device not being torn down. This prevents necessary
remediation actions like failover.

Permanently disable the device if the presence check fails.

Also, in eeh_dev_check_failure in we may consider the error as false
positive if the device is hotpluged out as the get_state call returns
EEH_STATE_NOT_SUPPORT and we may end up not clearing the device state,
so log the event if the state is not moved to permanent failure state.

Signed-off-by: Ganesh Goudar <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]

show more ...


Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6
# 82f63524 11-Oct-2023 Benjamin Gray <[email protected]>

powerpc/eeh: Remove unnecessary cast

Sparse reports a warning when casting to an int. There is no need to
cast in the first place, so drop them.

Signed-off-by: Benjamin Gray <[email protected]>
S

powerpc/eeh: Remove unnecessary cast

Sparse reports a warning when casting to an int. There is no need to
cast in the first place, so drop them.

Signed-off-by: Benjamin Gray <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]

show more ...


Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8
# 9efcdaac 09-Feb-2023 Ganesh Goudar <[email protected]>

powerpc/eeh: Set channel state after notifying the drivers

When a PCI error is encountered 6th time in an hour we
set the channel state to perm_failure and notify the
driver about the permanent fail

powerpc/eeh: Set channel state after notifying the drivers

When a PCI error is encountered 6th time in an hour we
set the channel state to perm_failure and notify the
driver about the permanent failure.

However, after upstream commit 38ddc011478e ("powerpc/eeh:
Make permanently failed devices non-actionable"), EEH handler
stops calling any routine once the device is marked as
permanent failure. This issue can lead to fatal consequences
like kernel hang with certain PCI devices.

Following log is observed with lpfc driver, with and without
this change, Without this change kernel hangs, If PCI error
is encountered 6 times for a device in an hour.

Without the change

EEH: Beginning: 'error_detected(permanent failure)'
PCI 0132:60:00.0#600000: EEH: not actionable (1,1,1)
PCI 0132:60:00.1#600000: EEH: not actionable (1,1,1)
EEH: Finished:'error_detected(permanent failure)'

With the change

EEH: Beginning: 'error_detected(permanent failure)'
EEH: Invoking lpfc->error_detected(permanent failure)
EEH: lpfc driver reports: 'disconnect'
EEH: Invoking lpfc->error_detected(permanent failure)
EEH: lpfc driver reports: 'disconnect'
EEH: Finished:'error_detected(permanent failure)'

To fix the issue, set channel state to permanent failure after
notifying the drivers.

Fixes: 38ddc011478e ("powerpc/eeh: Make permanently failed devices non-actionable")
Suggested-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Ganesh Goudar <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8
# 2b461880 18-Jul-2022 Michael Ellerman <[email protected]>

powerpc: Fix all occurences of duplicate words

Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate w

powerpc: Fix all occurences of duplicate words

Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate words.

Just fix them all at once, to save the overhead of dealing with
individual patches for each case.

This leaves a few cases of "that that", which in some contexts is
correct.

Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8
# 86c38fec 08-Mar-2022 Christophe Leroy <[email protected]>

powerpc: Remove asm/prom.h from all files that don't need it

Several files include asm/prom.h for no reason.

Clean it up.

Signed-off-by: Christophe Leroy <[email protected]>
[mpe: Drop c

powerpc: Remove asm/prom.h from all files that don't need it

Several files include asm/prom.h for no reason.

Clean it up.

Signed-off-by: Christophe Leroy <[email protected]>
[mpe: Drop change to prom_parse.c as reported by [email protected]]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7c9b8fda63dcf63e1b28f43e7ebdb95182cbc286.1646767214.git.christophe.leroy@csgroup.eu

show more ...


Revision tags: v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6
# 157616f3 15-Oct-2021 Oliver O'Halloran <[email protected]>

powerpc/eeh: Use a goto for recovery failures

The EEH recovery logic in eeh_handle_normal_event() has some pretty strange
flow control. If we remove all the actual recovery logic we're left with
the

powerpc/eeh: Use a goto for recovery failures

The EEH recovery logic in eeh_handle_normal_event() has some pretty strange
flow control. If we remove all the actual recovery logic we're left with
the following skeleton:

if (result != PCI_ERS_RESULT_DISCONNECT) {
...
}

if (result != PCI_ERS_RESULT_DISCONNECT) {
...
}

if (result == PCI_ERS_RESULT_NONE) {
...
}

if (result == PCI_ERS_RESULT_CAN_RECOVER) {
...
}

if (result == PCI_ERS_RESULT_CAN_RECOVER) {
...
}

if (result == PCI_ERS_RESULT_NEED_RESET) {
...
}

if ((result == PCI_ERS_RESULT_RECOVERED) ||
(result == PCI_ERS_RESULT_NONE)) {
...
goto out;
}

/*
* unsuccessful recovery / PCI_ERS_RESULT_DISCONECTED
* handling is here.
*/
...

out:
...

Most of the "if () { ... }" blocks above change "result" to
PCI_ERS_RESULT_DISCONNECTED if an error occurs in that recovery step. This
makes the control flow a bit confusing since it breaks the early-exit
pattern that is generally used in Linux. In any case we end up handling the
error in the final else block so why not just jump there directly? Doing so
also allows us to de-indent a bunch of code.

No functional changes.

[dja: rebase on top of linux-next + my preceeding refactor,
move clearing the EEH_DEV_NO_HANDLER bit above the first goto so that
it is always clear in the error handler code as it was before.]

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Daniel Axtens <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 10b34ece 15-Oct-2021 Daniel Axtens <[email protected]>

powerpc/eeh: Small refactor of eeh_handle_normal_event()

The control flow of eeh_handle_normal_event() is a bit tricky.

Break out one of the error handling paths - rather than be in an else
block,

powerpc/eeh: Small refactor of eeh_handle_normal_event()

The control flow of eeh_handle_normal_event() is a bit tricky.

Break out one of the error handling paths - rather than be in an else
block, we'll make it part of the regular body of the function and put a
'goto out;' in the true limb of the if.

Signed-off-by: Daniel Axtens <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 4141127c 12-Oct-2021 Uwe Kleine-König <[email protected]>

powerpc/eeh: Use to_pci_driver() instead of pci_dev->driver

Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pc

powerpc/eeh: Use to_pci_driver() instead of pci_dev->driver

Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.

Replace pdev->driver with to_pci_driver(). This is a step toward removing
pci_dev->driver.

[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>

show more ...


Revision tags: v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7
# d923ab7a 25-Jul-2020 Oliver O'Halloran <[email protected]>

powerpc/eeh: Rename eeh_{add_to|remove_from}_parent_pe()

The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect
what they actually do. If the PE referred to be edev->pe_config_add

powerpc/eeh: Rename eeh_{add_to|remove_from}_parent_pe()

The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect
what they actually do. If the PE referred to be edev->pe_config_addr
already exists under that PHB then the edev is added to that PE. However,
if the PE doesn't exist the a new one is created for the edev.

The bulk of the implementation of eeh_add_to_parent_pe() covers that
second case. Similarly, most of eeh_remove_from_parent_pe() is
determining when it's safe to delete a PE.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 8225d543 25-Jul-2020 Oliver O'Halloran <[email protected]>

powerpc/eeh: Pass eeh_dev to eeh_ops->resume_notify()

Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference
a specific device rather than pci_dn. No functional changes.

Signed

powerpc/eeh: Pass eeh_dev to eeh_ops->resume_notify()

Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference
a specific device rather than pci_dn. No functional changes.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# dffa9153 25-Jul-2020 Oliver O'Halloran <[email protected]>

powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev

Drivers that do not support the PCI error handling callbacks are handled by
tearing down the device and re-probing them. If the device being

powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev

Drivers that do not support the PCI error handling callbacks are handled by
tearing down the device and re-probing them. If the device being removed is
a virtual function then we need to know the VF index so it can be removed
using the pci_iov_{add|remove}_virtfn() API.

Currently this is handled by looking up the pci_dn, and using the vf_index
that was stashed there when the pci_dn for the VF was created in
pcibios_sriov_enable(). We would like to eliminate the use of pci_dn
outside of pseries though so we need to provide the generic EEH code with
some other way to find the vf_index.

The easiest thing to do here is move the vf_index field out of pci_dn and
into eeh_dev. Currently pci_dn and eeh_dev are allocated and initialized
together so this is a fairly minimal change in preparation for splitting
pci_dn and eeh_dev in the future.

Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.8-rc6, v5.8-rc5, v5.8-rc4
# 16d79cd4 02-Jul-2020 Luc Van Oostenryck <[email protected]>

PCI: Use 'pci_channel_state_t' instead of 'enum pci_channel_state'

The method struct pci_error_handlers.error_detected() is defined and
documented as taking an 'enum pci_channel_state' for the secon

PCI: Use 'pci_channel_state_t' instead of 'enum pci_channel_state'

The method struct pci_error_handlers.error_detected() is defined and
documented as taking an 'enum pci_channel_state' for the second argument,
but most drivers use 'pci_channel_state_t' instead.

This 'pci_channel_state_t' is not a typedef for the enum but a typedef for
a bitwise type in order to have better/stricter typechecking.

Consolidate everything by using 'pci_channel_state_t' in the method's
definition, in the related helpers and in the drivers.

Enforce use of 'pci_channel_state_t' by replacing 'enum pci_channel_state'
with an anonymous 'enum'.

Note: Currently, from a typechecking point of view this patch changes
nothing because only the constants defined by the enum are bitwise, not the
enum itself (sparse doesn't have the notion of 'bitwise enum'). This may
change in some not too far future, hence the patch.

[bhelgaas: squash in
https://lore.kernel.org/r/[email protected]
https://lore.kernel.org/r/[email protected]]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luc Van Oostenryck <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>

show more ...


Revision tags: v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1
# d4f194ed 07-Feb-2020 Sam Bobroff <[email protected]>

powerpc/eeh: Fix deadlock handling dead PHB

Recovering a dead PHB can currently cause a deadlock as the PCI
rescan/remove lock is taken twice.

This is caused as part of an existing bug in
eeh_handl

powerpc/eeh: Fix deadlock handling dead PHB

Recovering a dead PHB can currently cause a deadlock as the PCI
rescan/remove lock is taken twice.

This is caused as part of an existing bug in
eeh_handle_special_event(). The pe is processed while traversing the
PHBs even though the pe is unrelated to the loop. This causes the pe
to be, incorrectly, processed more than once.

Untangling this section can move the pe processing out of the loop and
also outside the locked section, correcting both problems.

Fixes: 2e25505147b8 ("powerpc/eeh: Fix crash when edev->pdev changes")
Cc: [email protected] # 5.4+
Signed-off-by: Sam Bobroff <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
Tested-by: Frederic Barrat <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/0547e82dbf90ee0729a2979a8cac5c91665c621f.1581051445.git.sbobroff@linux.ibm.com

show more ...


Revision tags: v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6
# 3b5b9997 28-Oct-2019 Oliver O'Halloran <[email protected]>

powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number

On pseries there is a bug with adding hotplugged devices to an IOMMU
group. For a number of dumb reasons fixing that bug

powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number

On pseries there is a bug with adding hotplugged devices to an IOMMU
group. For a number of dumb reasons fixing that bug first requires
re-working how VFs are configured on PowerNV. For background, on
PowerNV we use the pcibios_sriov_enable() hook to do two things:

1. Create a pci_dn structure for each of the VFs, and
2. Configure the PHB's internal BARs so the MMIO range for each VF
maps to a unique PE.

Roughly speaking a PE is the hardware counterpart to a Linux IOMMU
group since all the devices in a PE share the same IOMMU table. A PE
also defines the set of devices that should be isolated in response to
a PCI error (i.e. bad DMA, UR/CA, AER events, etc). When isolated all
MMIO and DMA traffic to and from devicein the PE is blocked by the
root complex until the PE is recovered by the OS.

The requirement to block MMIO causes a giant headache because the P8
PHB generally uses a fixed mapping between MMIO addresses and PEs. As
a result we need to delay configuring the IOMMU groups for device
until after MMIO resources are assigned. For physical devices (i.e.
non-VFs) the PE assignment is done in pcibios_setup_bridge() which is
called immediately after the MMIO resources for downstream
devices (and the bridge's windows) are assigned. For VFs the setup is
more complicated because:

a) pcibios_setup_bridge() is not called again when VFs are activated, and
b) The pci_dev for VFs are created by generic code which runs after
pcibios_sriov_enable() is called.

The work around for this is a two step process:

1. A fixup in pcibios_add_device() is used to initialised the cached
pe_number in pci_dn, then
2. A bus notifier then adds the device to the IOMMU group for the PE
specified in pci_dn->pe_number.

A side effect fixing the pseries bug mentioned in the first paragraph
is moving the fixup out of pcibios_add_device() and into
pcibios_bus_add_device(), which is called much later. This results in
step 2. failing because pci_dn->pe_number won't be initialised when
the bus notifier is run.

We can fix this by removing the need for the fixup. The PE for a VF is
known before the VF is even scanned so we can initialise
pci_dn->pe_number pcibios_sriov_enable() instead. Unfortunately,
moving the initialisation causes two problems:

1. We trip the WARN_ON() in the current fixup code, and
2. The EEH core clears pdn->pe_number when recovering a VF and
relies on the fixup to correctly re-set it.

The only justification for either of these is a comment in
eeh_rmv_device() suggesting that pdn->pe_number *must* be set to
IODA_INVALID_PE in order for the VF to be scanned. However, this
comment appears to have no basis in reality. Both bugs can be fixed by
just deleting the code.

Tested-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.4-rc5, v5.4-rc4
# de84ffc3 17-Oct-2019 Sam Bobroff <[email protected]>

powerpc/eeh: differentiate duplicate detection message

Currently when an EEH error is detected, the system log receives the
same (or almost the same) message twice:

EEH: PHB#0 failure detected, l

powerpc/eeh: differentiate duplicate detection message

Currently when an EEH error is detected, the system log receives the
same (or almost the same) message twice:

EEH: PHB#0 failure detected, location: N/A
EEH: PHB#0 failure detected, location: N/A
or
EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
EEH: Frozen PHB#0-PE#0 detected

This looks like a bug, but in fact the messages are from different
functions and mean slightly different things. So keep both but change
one of the messages slightly, so that it's clear they are different:

EEH: PHB#0 failure detected, location: N/A
EEH: Recovering PHB#0, location: N/A
or
EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
EEH: Recovering PHB#0-PE#0

Signed-off-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/43817cb6e6631b0828b9a6e266f60d1f8ca8eb22.1571288375.git.sbobroff@linux.ibm.com

show more ...


Revision tags: v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7
# bbbd7f11 28-Aug-2019 Thomas Huth <[email protected]>

powerpc: Replace GPL boilerplate with SPDX identifiers

The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's simply use proper SPDX identifiers instead.

Signed-off-by: Thomas Huth <

powerpc: Replace GPL boilerplate with SPDX identifiers

The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's simply use proper SPDX identifiers instead.

Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Russell Currey <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 1b7f3b6c 13-Sep-2019 Michael Ellerman <[email protected]>

powerpc/eeh: Fix build with STACKTRACE=n

The build breaks when STACKTRACE=n, eg. skiroot_defconfig:

arch/powerpc/kernel/eeh_event.c:124:23: error: implicit declaration of function 'stack_trace_sa

powerpc/eeh: Fix build with STACKTRACE=n

The build breaks when STACKTRACE=n, eg. skiroot_defconfig:

arch/powerpc/kernel/eeh_event.c:124:23: error: implicit declaration of function 'stack_trace_save'

Fix it with some ifdefs for now.

Fixes: 25baf3d81614 ("powerpc/eeh: Defer printing stack trace")
Signed-off-by: Michael Ellerman <[email protected]>

show more ...


# aeff27c1 03-Sep-2019 Oliver O'Halloran <[email protected]>

powerpc/eeh: Set attention indicator while recovering

I am the RAS team. Hear me roar.

Roar.

On a more serious note, being able to locate failed devices can be helpful.
Set the attention indicator

powerpc/eeh: Set attention indicator while recovering

I am the RAS team. Hear me roar.

Roar.

On a more serious note, being able to locate failed devices can be helpful.
Set the attention indicator if the slot supports it once we've determined
the device is present and only clear it if the device is fully recovered.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 25baf3d8 03-Sep-2019 Oliver O'Halloran <[email protected]>

powerpc/eeh: Defer printing stack trace

Currently we print a stack trace in the event handler to help with
debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
so we want to pre

powerpc/eeh: Defer printing stack trace

Currently we print a stack trace in the event handler to help with
debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
so we want to prevent printing the stack trace unless we know it's due to
an actual device error. To accomplish this, we can save a stack trace at
the point of detection and only print it once the EEH recovery handler has
determined the freeze was due to an actual error.

Since the whole point of this is to prevent spurious EEH output we also
move a few prints out of the detection thread, or mark them as pr_debug
so anyone interested can get output from the eeh_check_dev_failure()
if they want.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# b104af5a 03-Sep-2019 Oliver O'Halloran <[email protected]>

powerpc/eeh: Check slot presence state in eeh_handle_normal_event()

When a device is surprise removed while undergoing IO we will probably
get an EEH PE freeze due to MMIO timeouts and other errors.

powerpc/eeh: Check slot presence state in eeh_handle_normal_event()

When a device is surprise removed while undergoing IO we will probably
get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
is detected we send a recovery event to the EEH worker thread which will
notify drivers, and perform recovery as needed.

In the event of a hot-remove we don't want recovery to occur since there
isn't a device to recover. The recovery process is fairly long due to
the number of wait states (required by PCIe) which causes problems when
devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).

To determine if we need to skip the recovery process we can use the
get_adapter_state() operation of the hotplug_slot to determine if the
slot contains a device or not, and if the slot is empty we can skip
recovery entirely.

One thing to note is that the slot being EEH frozen does not prevent the
hotplug driver from working. We don't have the EEH recovery thread
remove any of the devices since it's assumed that the hotplug driver
will handle tearing down the slot state.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 38ddc011 03-Sep-2019 Oliver O'Halloran <[email protected]>

powerpc/eeh: Make permanently failed devices non-actionable

If a device is torn down by a hotplug slot driver it's marked as removed
and marked as permaantly failed. There's no point in trying to re

powerpc/eeh: Make permanently failed devices non-actionable

If a device is torn down by a hotplug slot driver it's marked as removed
and marked as permaantly failed. There's no point in trying to recover a
permernantly failed device so it should be considered un-actionable.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 799abe28 03-Sep-2019 Oliver O'Halloran <[email protected]>

powerpc/eeh: Clean up EEH PEs after recovery finishes

When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This

powerpc/eeh: Clean up EEH PEs after recovery finishes

When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This
results in a crash when a hotplug driver is involved since the following
may occur:

1. Device is suprise removed.
2. Driver performs an MMIO, which fails and queues and eeh_event.
3. Hotplug driver receives a hotplug interrupt and removes any
pci_devs that were under the slot.
4. pci_dev is torn down and the eeh_pe is freed.
5. The EEH event handler thread processes the eeh_event and crashes
since the eeh_pe pointer in the eeh_event structure is no
longer valid.

Crashing is generally considered poor form. Instead of doing that use
the fact PEs are marked as EEH_PE_INVALID to keep them around until the
end of the recovery cycle, at which point we can safely prune any empty
PEs.

Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v5.3-rc6, v5.3-rc5
# cef50c67 16-Aug-2019 Sam Bobroff <[email protected]>

powerpc/eeh: Remove unused return path from eeh_pe_dev_traverse()

There are no users of the early-out return value from
eeh_pe_dev_traverse(), so remove it.

Signed-off-by: Sam Bobroff <sbobroff@lin

powerpc/eeh: Remove unused return path from eeh_pe_dev_traverse()

There are no users of the early-out return value from
eeh_pe_dev_traverse(), so remove it.

Signed-off-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/c648070f5b28fe8ca1880b48e64b267959ffd369.1565930772.git.sbobroff@linux.ibm.com

show more ...


# 2e255051 16-Aug-2019 Sam Bobroff <[email protected]>

powerpc/eeh: Fix crash when edev->pdev changes

If a PCI device is removed during eeh_pe_report_edev(), between the
calls to device_lock() and device_unlock(), edev->pdev will change and
cause a cras

powerpc/eeh: Fix crash when edev->pdev changes

If a PCI device is removed during eeh_pe_report_edev(), between the
calls to device_lock() and device_unlock(), edev->pdev will change and
cause a crash as the wrong mutex is released.

To correct this, hold the PCI rescan/remove lock while taking a copy
of edev->pdev and performing a get_device() on it. Use this value to
release the mutex, but also pass it through to the device driver's EEH
handlers so that they always see the same device.

Signed-off-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/3c590579a0faa24d20c826dcd26c739eb4d454e6.1565930772.git.sbobroff@linux.ibm.com

show more ...


# 1ff8f36f 16-Aug-2019 Sam Bobroff <[email protected]>

powerpc/eeh: Convert log messages to eeh_edev_* macros

Convert existing messages, where appropriate, to use the eeh_edev_*
logging macros.

The only effect should be minor adjustments to the log mes

powerpc/eeh: Convert log messages to eeh_edev_* macros

Convert existing messages, where appropriate, to use the eeh_edev_*
logging macros.

The only effect should be minor adjustments to the log messages, apart
from:

- A new message in pseries_eeh_probe() "Probing device" to match the
powernv case.
- The "Probing device" message in pnv_eeh_probe() is now generated
slightly later, which will mean that it is no longer emitted for
devices that aren't probed due to the initial checks.

Signed-off-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/ce505a0a7a4a5b0367f0f40f8b26e7c0a9cf4cb7.1565930772.git.sbobroff@linux.ibm.com

show more ...


12345