|
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2 |
|
| #
8470bf2b |
| 12-Jan-2022 |
Austin Kerbow <[email protected]> |
[AMDGPU] Do not reserve any VGPR for SGPR spills
After the split register allocation changes in eebe841a47cb it is no longer necessary to reserve a VGPR before RA. This can also create bugs when IPR
[AMDGPU] Do not reserve any VGPR for SGPR spills
After the split register allocation changes in eebe841a47cb it is no longer necessary to reserve a VGPR before RA. This can also create bugs when IPRA is enabled since we cannot predict that a called function may not reserve any register if it does not have any SGPR spills. If that happens those functions may override reserved registers that are normally callee saved. Added a test to show this.
Fixes: SWDEV-309900
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D115551
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2 |
|
| #
4a36e96c |
| 21-Aug-2021 |
Matt Arsenault <[email protected]> |
RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heu
RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heuristic to use. This was taking the raw number of registers in the class, even though not all of them may be available. AMDGPU heavily relies on dynamically reserved numbers of registers based on user attributes to satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that made me notice this, the live range size is incorrect after the scheduler rearranges instructions, since the instructions don't have the original InstrDist offsets. Additionally, I think it would be more appropriate to use the number of disjointly allocatable registers in the class. For the AMDGPU register tuples, there are a large number of registers in each tuple class, but only a small fraction can actually be allocated at the same time since they all overlap with each other. It seems we do not have a query that corresponds to the number of independently allocatable registers. Relatedly, I'm still debugging some allocation failures where overlapping tuples seem to not be handled correctly.
The test changes are mostly noise. There are a handful of x86 tests that look like regressions with an additional spill, and a handful that now avoid a spill. The worst looking regression is likely test/Thumb2/mve-vld4.ll which introduces a few additional spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll shows a massive improvement by completely eliminating a large number of spills inside a loop.
show more ...
|
| #
fbae3463 |
| 17-Aug-2021 |
Sebastian Neubauer <[email protected]> |
[GlobalISel] Add combine for PTR_ADD with regbanks
Combine two G_PTR_ADDs, but keep the register bank of the constant. That way, the combine can be used in post-regbank-select combines.
Introduce t
[GlobalISel] Add combine for PTR_ADD with regbanks
Combine two G_PTR_ADDs, but keep the register bank of the constant. That way, the combine can be used in post-regbank-select combines.
Introduce two helper methods in CombinerHelper, getRegBank and setRegBank that get and set an optional register bank to a register. That way, they can be used before and after register bank selection.
Differential Revision: https://reviews.llvm.org/D103326
show more ...
|
|
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1, llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4, llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2, llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1, llvmorg-11.0.1, llvmorg-11.0.1-rc2, llvmorg-11.0.1-rc1, llvmorg-11.0.0, llvmorg-11.0.0-rc6, llvmorg-11.0.0-rc5, llvmorg-11.0.0-rc4, llvmorg-11.0.0-rc3, llvmorg-11.0.0-rc2, llvmorg-11.0.0-rc1, llvmorg-12-init, llvmorg-10.0.1, llvmorg-10.0.1-rc4, llvmorg-10.0.1-rc3, llvmorg-10.0.1-rc2, llvmorg-10.0.1-rc1, llvmorg-10.0.0, llvmorg-10.0.0-rc6, llvmorg-10.0.0-rc5, llvmorg-10.0.0-rc4, llvmorg-10.0.0-rc3, llvmorg-10.0.0-rc2, llvmorg-10.0.0-rc1, llvmorg-11-init, llvmorg-9.0.1, llvmorg-9.0.1-rc3, llvmorg-9.0.1-rc2, llvmorg-9.0.1-rc1, llvmorg-9.0.0, llvmorg-9.0.0-rc6, llvmorg-9.0.0-rc5, llvmorg-9.0.0-rc4, llvmorg-9.0.0-rc3, llvmorg-9.0.0-rc2, llvmorg-9.0.0-rc1, llvmorg-10-init, llvmorg-8.0.1, llvmorg-8.0.1-rc4, llvmorg-8.0.1-rc3, llvmorg-8.0.1-rc2, llvmorg-8.0.1-rc1, llvmorg-8.0.0, llvmorg-8.0.0-rc5, llvmorg-8.0.0-rc4, llvmorg-8.0.0-rc3, llvmorg-7.1.0, llvmorg-7.1.0-rc1, llvmorg-8.0.0-rc2, llvmorg-8.0.0-rc1, llvmorg-7.0.1, llvmorg-7.0.1-rc3, llvmorg-7.0.1-rc2, llvmorg-7.0.1-rc1 |
|
| #
eebe841a |
| 26-Sep-2018 |
Matt Arsenault <[email protected]> |
RegAlloc: Allow targets to split register allocation
AMDGPU normally spills SGPRs to VGPRs. Previously, since all register classes are handled at the same time, this was problematic. We don't know a
RegAlloc: Allow targets to split register allocation
AMDGPU normally spills SGPRs to VGPRs. Previously, since all register classes are handled at the same time, this was problematic. We don't know ahead of time how many registers will be needed to be reserved to handle the spilling. If no VGPRs were left for spilling, we would have to try to spill to memory. If the spilled SGPRs were required for exec mask manipulation, it is highly problematic because the lanes active at the point of spill are not necessarily the same as at the restore point.
Avoid this problem by fully allocating SGPRs in a separate regalloc run from VGPRs. This way we know the exact number of VGPRs needed, and can reserve them for a second run. This fixes the most serious issues, but it is still possible using inline asm to make all VGPRs unavailable. Start erroring in the case where we ever would require memory for an SGPR spill.
This is implemented by giving each regalloc pass a callback which reports if a register class should be handled or not. A few passes need some small changes to deal with leftover virtual registers.
In the AMDGPU implementation, a new pass is introduced to take the place of PrologEpilogInserter for SGPR spills emitted during the first run.
One disadvantage of this is currently StackSlotColoring is no longer used for SGPR spills. It would need to be run again, which will require more work.
Error if the standard -regalloc option is used. Introduce new separate -sgpr-regalloc and -vgpr-regalloc flags, so the two runs can be controlled individually. PBQB is not currently supported, so this also prevents using the unhandled allocator.
show more ...
|
| #
f9f5d415 |
| 30-Apr-2021 |
Brendon Cahoon <[email protected]> |
[AMDGPU][GlobalISel] Legalize and select G_SBFX and G_UBFX
Adds legalizer, register bank select, and instruction select support for G_SBFX and G_UBFX. These opcodes generate scalar or vector ALU bit
[AMDGPU][GlobalISel] Legalize and select G_SBFX and G_UBFX
Adds legalizer, register bank select, and instruction select support for G_SBFX and G_UBFX. These opcodes generate scalar or vector ALU bitfield extract instructions for AMDGPU. The instructions allow both constant or register values for the offset and width operands.
The 32-bit scalar version is expanded to a sequence that combines the offset and width into a single register.
There are no 64-bit vgpr bitfield extract instructions, so the operations are expanded to a sequence of instructions that implement the operation. If the width is a constant, then the 32-bit bitfield extract instructions are used.
Moved the AArch64 specific code for creating G_SBFX to CombinerHelper.cpp so that it can be used by other targets. Only bitfield extracts with constant offset and width values are handled currently.
Differential Revision: https://reviews.llvm.org/D100149
show more ...
|
| #
96e1fcb1 |
| 07-Jun-2021 |
Sebastian Neubauer <[email protected]> |
[AMDGPU] Use s_add_i32 for address additions
This allows to convert the add instruction to s_addk_i32 and v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU instruction.
Differe
[AMDGPU] Use s_add_i32 for address additions
This allows to convert the add instruction to s_addk_i32 and v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU instruction.
Differential Revision: https://reviews.llvm.org/D103322
show more ...
|
| #
70cb57d7 |
| 04-Mar-2021 |
Matt Arsenault <[email protected]> |
AMDGPU/GlobalISel: Improve private addressing mode matching
This enables the look-through-copy to hack around not correctly regbankselecting constants to match the use bank.
|
| #
8b898b19 |
| 02-Feb-2021 |
Sebastian Neubauer <[email protected]> |
[AMDGPU] Remove unused tmp register
The temporary register is only used to compute the frame pointer. The frame pointer is overwritten and not used in between, so we can reuse the frame pointer for
[AMDGPU] Remove unused tmp register
The temporary register is only used to compute the frame pointer. The frame pointer is overwritten and not used in between, so we can reuse the frame pointer for the computation, saving one register.
Differential Revision: https://reviews.llvm.org/D95865
show more ...
|
| #
2291bd13 |
| 30-Nov-2020 |
Austin Kerbow <[email protected]> |
[AMDGPU] Update subtarget features for new target ID support
Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic
[AMDGPU] Update subtarget features for new target ID support
Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic subtarget features.
The possible settings are:
- Unsupported: The GPU has no support for XNACK/SRAMECC. - Any: Preference is unspecified. Use conservative settings that can run anywhere. - Off: Request support for XNACK/SRAMECC Off - On: Request support for XNACK/SRAMECC On
GCNSubtarget will track the four options based on the following criteria. If the subtarget does not support XNACK/SRAMECC we say the setting is "Unsupported". If no subtarget features for XNACK/SRAMECC are requested we must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the feature string when initializing the subtarget, the settings are "On/Off".
The defaults are updated to be conservatively correct, meaning if no setting for XNACK or SRAMECC is explicitly requested, defaults will be used which generate code that can be run anywhere. This corresponds to the "Any" setting.
Differential Revision: https://reviews.llvm.org/D85882
show more ...
|
| #
221fdedc |
| 22-Dec-2020 |
Sebastian Neubauer <[email protected]> |
[AMDGPU][GlobalISel] Fold flat vgpr + constant addresses
Use getPtrBaseWithConstantOffset in selectFlatOffsetImpl to fold more vgpr+constant addresses.
Differential Revision: https://reviews.llvm.o
[AMDGPU][GlobalISel] Fold flat vgpr + constant addresses
Use getPtrBaseWithConstantOffset in selectFlatOffsetImpl to fold more vgpr+constant addresses.
Differential Revision: https://reviews.llvm.org/D93692
show more ...
|
| #
663f1668 |
| 14-Oct-2020 |
Matt Arsenault <[email protected]> |
AMDGPU: Fix verifier error on killed spill of partially undef register
This does unfortunately end up with extra waitcnts getting inserted that were avoided before. Ideally we would avoid the spills
AMDGPU: Fix verifier error on killed spill of partially undef register
This does unfortunately end up with extra waitcnts getting inserted that were avoided before. Ideally we would avoid the spills of these undef components in the first place.
show more ...
|
| #
a343b9b0 |
| 23-Sep-2020 |
Sebastian Neubauer <[email protected]> |
Revert "[AMDGPU] Insert waitcnt after returning from call"
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer, > This completely broke the Mesa radeonsi drive
Revert "[AMDGPU] Insert waitcnt after returning from call"
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours.
show more ...
|
| #
ca907bfb |
| 04-Sep-2020 |
Sebastian Neubauer <[email protected]> |
[AMDGPU] Insert waitcnt after returning from call
When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished
[AMDGPU] Insert waitcnt after returning from call
When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding.
For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel.
This commit implements waiting in the caller after returning from a function call.
Differential Revision: https://reviews.llvm.org/D87674
show more ...
|
| #
c799f873 |
| 03-Jun-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Don't cluster stores
Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets.
The disadvantage is that it tends to i
[AMDGPU] Don't cluster stores
Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets.
The disadvantage is that it tends to increase register pressure and restricts scheduling freedom.
Differential Revision: https://reviews.llvm.org/D85530
show more ...
|
| #
3359ea62 |
| 07-Aug-2020 |
QingShan Zhang <[email protected]> |
[Scheduling] Create the missing dependency edges for store cluster
If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base regis
[Scheduling] Create the missing dependency edges for store cluster
If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base register "reg"
+-------+ +----> reg | | +---+---+ | ^ | | | | | | | +---+---+ | | SUa | Load 0(reg) | +---+---+ | ^ | | | | | +---+---+ +----+ SUb | Load 4(reg) +-------+
But if it is store cluster, we need to create it as follow shows to avoid the instruction store depend on scheduled in-between SUb and SUa.
+-------+ +----> reg | | +---+---+ | ^ | | Missing +-------+ | | +-------------------->+ y | | | | +---+---+ | +---+-+-+ ^ | | SUa | Store x 0(reg) | | +---+---+ | | ^ | | | +------------------------+ | | | | +---+--++ +----+ SUb | Store y 4(reg) +-------+
Reviewed By: evandro, arsenm, rampitec, foad, fhahn
Differential Revision: https://reviews.llvm.org/D72031
show more ...
|
| #
e0020153 |
| 27-Jul-2020 |
Matt Arsenault <[email protected]> |
GlobalISel: Implement fewerElementsVector for G_EXTRACT_VECTOR_ELT
Use the same basic strategy as LegalizeVectorTypes. Try to index into smaller pieces if there's a constant index, and otherwise fal
GlobalISel: Implement fewerElementsVector for G_EXTRACT_VECTOR_ELT
Use the same basic strategy as LegalizeVectorTypes. Try to index into smaller pieces if there's a constant index, and otherwise fall back to a stack temporary.
show more ...
|