| 58029c39 | 27-Feb-2025 |
Yazen Ghannam <[email protected]> |
RAS/AMD/FMPM: Get masked address
Some operations require checking, or ignoring, specific bits in an address value. For example, this can be comparing address values to identify unique structures.
C
RAS/AMD/FMPM: Get masked address
Some operations require checking, or ignoring, specific bits in an address value. For example, this can be comparing address values to identify unique structures.
Currently, the full address value is compared when filtering for duplicates. This results in over counting and creation of extra records. This gives the impression that more unique events occurred than did in reality.
Mask the address for physical rows on MI300.
[ bp: Simplify. ]
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: [email protected]
show more ...
|
| ba437905 | 07-Jun-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation
The currently used normalized address format is not applicable to all MI300 systems. This leads to incorrect results
RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation
The currently used normalized address format is not applicable to all MI300 systems. This leads to incorrect results during address translation.
Drop the fixed layout and construct the normalized address from system settings.
Fixes: 87a612375307 ("RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| f4c0cd18 | 06-Jun-2024 |
John Allen <[email protected]> |
RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
Both the AMD ATL and the FMPM driver define INVALID_SPA. Include the definition from the ATL internal.h header in the FMPM driver.
Signed-off-by: Jo
RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
Both the AMD ATL and the FMPM driver define INVALID_SPA. Include the definition from the ATL internal.h header in the FMPM driver.
Signed-off-by: John Allen <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| e0372d69 | 06-Jun-2024 |
John Allen <[email protected]> |
RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
Unlike with previous Data Fabric versions, with Data Fabric 4.5 non-power-of-2 denormalization, there are bits of the system physical address that c
RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
Unlike with previous Data Fabric versions, with Data Fabric 4.5 non-power-of-2 denormalization, there are bits of the system physical address that can't be fully reconstructed from the normalized address.
To determine the proper combination of missing system physical address bits, iterate through each possible combination of these bits, normalize the resulting system physical address, and compare to the original address that is being translated. If the addresses match, then the correct permutation of bits has been found.
Signed-off-by: John Allen <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 6cce048c | 06-Jun-2024 |
John Allen <[email protected]> |
RAS/AMD/ATL: Expand helpers for adding and removing base and hole
The ret_addr field in struct addr_ctx contains the intermediate value of the returned address as it passes through multiple steps in
RAS/AMD/ATL: Expand helpers for adding and removing base and hole
The ret_addr field in struct addr_ctx contains the intermediate value of the returned address as it passes through multiple steps in the translation process. Currently, adding the DRAM base and legacy hole is only done once, so it operates directly on the intermediate value.
However, for DF 4.5 non-power-of-2 denormalization, adding and removing the DRAM base and legacy hole needs to be done for multiple temporary address values. During this process, the intermediate value should not be lost so the ret_addr value can't be reused.
Update the existing 'add' helper to operate on an arbitrary address and introduce a new 'remove' helper to do the inverse operations.
Signed-off-by: John Allen <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 9b195439 | 19-Mar-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/FMPM: Safely handle saved records of various sizes
Currently, the size of the locally cached FRU record structures is based on the module parameter "max_nr_entries".
This creates issues whe
RAS/AMD/FMPM: Safely handle saved records of various sizes
Currently, the size of the locally cached FRU record structures is based on the module parameter "max_nr_entries".
This creates issues when restoring records if a user changes the parameter.
If the number of entries is reduced, then old, larger records will not be restored. The opportunity to take action on the saved data is missed. Also, new records will be created and written to storage, even as the old records remain in storage, resulting in wasted space.
If the number of entries is increased, then the length of the old, smaller records will not be adjusted. This causes a checksum failure which leads to the old record being cleared from storage. Again this results in another missed opportunity for action on the saved data.
Allocate the temporary record with the maximum possible size based on the current maximum number of supported entries (255). This allows the ERST read operation to succeed if max_nr_entries has been increased.
Warn the user if a saved record exceeds the expected size and fail to load the module. This allows the user to adjust the module parameter without losing data or the opportunity to restore larger records.
Increase the size of a saved record up to the current max_rec_len. The checksum will be recalculated, and the updated record will be written to storage.
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Muralidhara M K <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| bd17b7c3 | 06-Mar-2024 |
Dan Carpenter <[email protected]> |
RAS/AMD/FMPM: Fix off by one when unwinding on error
Decrement the index variable i before the first iteration when freeing the remaining elements on error. Depending on where this fails it could fr
RAS/AMD/FMPM: Fix off by one when unwinding on error
Decrement the index variable i before the first iteration when freeing the remaining elements on error. Depending on where this fails it could free something from one element beyond the end of the fru_records[] array.
[ bp: Massage commit message. ]
Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 7d19eea5 | 01-Mar-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/FMPM: Add debugfs interface to print record entries
It is helpful to see the saved record entries during run time in human-readable format. This is useful for testing during module developme
RAS/AMD/FMPM: Add debugfs interface to print record entries
It is helpful to see the saved record entries during run time in human-readable format. This is useful for testing during module development. It can also be used by system admins to quickly and easily see the state of the system.
Provide a sequential file in debugfs to print fields of interest from the FRU records and their entries.
Don't fail to load the module if the debugfs interface is not available. This is a convenience feature which does not affect other module functionality.
The new interface reads the record entries and should hold the mutex. Expand the mutex code comment to clarify when it should be held.
Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 838850c5 | 01-Mar-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/FMPM: Save SPA values
The system physical address (SPA) of an error is not a stable value. It will change depending on the location of the memory: parts can be swapped. And it will change de
RAS/AMD/FMPM: Save SPA values
The system physical address (SPA) of an error is not a stable value. It will change depending on the location of the memory: parts can be swapped. And it will change depending on memory topology: NUMA nodes and/or interleaving can be adjusted.
Therefore, the SPA value is not part of the "FRU Memory Poison" record format. And it will not be saved to persistent storage.
However, the SPA values can be helpful during debug and for system admins during run time.
Save the SPA values in a separate structure. This is updated when records are restored and when new errors are saved.
[ bp: Make error messages more user friendly and add and correct comments. ]
Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| dd61b55d | 22-Feb-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
The hash_pa8 and hashed_bit values in denorm_addr_df4_np2() are currently defined as u8 types. These variables represent single bits.
'hash_pa
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
The hash_pa8 and hashed_bit values in denorm_addr_df4_np2() are currently defined as u8 types. These variables represent single bits.
'hash_pa8' is set based on logical AND operations using masks with more than 8 bits. So the calculated value will not fit in this variable. It will always be '0'. The 'hash_pa8' check later in the function will fail which produces incorrect results for some cases.
Change these variables to bool type. This clarifies that they are single bit values. Also, this allows the compiler to ensure they hold the proper results. Remove an unnecessary shift operation.
[ bp: Remove the unnecessary brackets in the else-branch of the hash_pa8 assignment. ]
Fixes: 3f3174996be6 ("RAS: Introduce AMD Address Translation Library") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 6f15e617 | 14-Feb-2024 |
Yazen Ghannam <[email protected]> |
RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acce
RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted.
If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost.
Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets.
A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver.
Two components are needed: a record format and a persistent storage interface.
Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed.
[ bp: Massage commit message and code, squash 30-ish more fixes from Yazen and me. ]
Signed-off-by: Yazen Ghannam <[email protected]> Co-developed-by: <[email protected]> Signed-off-by: <[email protected]> Co-developed-by: <[email protected]> Signed-off-by: <[email protected]> Tested-by: <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| 87a61237 | 31-Jan-2024 |
Yazen Ghannam <[email protected]> |
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
Zen-based AMD systems report DRAM ECC errors through Unified Memory Controller (UMC) MCA banks. The value provided in MCA_ADDR i
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
Zen-based AMD systems report DRAM ECC errors through Unified Memory Controller (UMC) MCA banks. The value provided in MCA_ADDR is a "normalized" address which represents the UMC's view of its managed memory. The normalized address must be translated to a system physical address for software to take action.
MI300 systems, uniquely, do not provide a normalized address in MCA_ADDR for DRAM ECC errors. Rather, the "DRAM" address is reported. This value includes identifiers for the bank, row, column, pseudochannel and stack of the memory location.
The DRAM address must be converted to a normalized address in order to be further translated to a system physical address.
Add helper functions to do the DRAM to normalized translation for MI300 systems. The method is based on the fixed hardware layout of the on-chip memory.
[ bp: Massage commit message, decapitalize some, rename function. ]
Signed-off-by: Yazen Ghannam <[email protected]> Co-developed-by: Muralidhara M K <[email protected]> Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Muralidhara M K <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|