|
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6 |
|
| #
2e29b013 |
| 21-Jun-2022 |
Alexander Timofeev <[email protected]> |
[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.
Since the divergence-driven instruction selection has been enabled for AMDGPU, all the uniform instructions are expected
[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.
Since the divergence-driven instruction selection has been enabled for AMDGPU, all the uniform instructions are expected to be selected to SALU form, except those not having one. VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form and overhead introduced by the data transfer between the VGPRs and SGPRs.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D128252
show more ...
|
|
Revision tags: llvmorg-14.0.5, llvmorg-14.0.4 |
|
| #
e2926501 |
| 16-May-2022 |
Jay Foad <[email protected]> |
[AMDGPU] Aggressively fold immediates in SIShrinkInstructions
Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage.
Dif
[AMDGPU] Aggressively fold immediates in SIShrinkInstructions
Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
show more ...
|
| #
3eb2281b |
| 16-May-2022 |
Jay Foad <[email protected]> |
[AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a non-inlinable immediate into a single user, so as not to increase code size by ad
[AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a non-inlinable immediate into a single user, so as not to increase code size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate will be folded into any number of users. The rationale is: - It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.) - It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it. - The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
show more ...
|
| #
1ecc3d86 |
| 14-May-2022 |
Simon Pilgrim <[email protected]> |
[DAG] Enable ISD::SHL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits
Pulled out of D77804 as its going to be easier to address the regressions individually.
This patch allows
[DAG] Enable ISD::SHL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits
Pulled out of D77804 as its going to be easier to address the regressions individually.
This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.
The lost RISCV gorc2 fold shouldn't be a problem - instcombine would have already destroyed that pattern - see https://github.com/llvm/llvm-project/issues/50553
Differential Revision: https://reviews.llvm.org/D124839
show more ...
|
|
Revision tags: llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3 |
|
| #
0776f6e0 |
| 13-Jan-2022 |
Benjamin Kramer <[email protected]> |
[LSV] Vectorize loads of vectors by turning it into a larger vector
Use shufflevector to do the subvector extracts. This allows a lot more load merging on AMDGPU and also on NVPTX when <2 x half> is
[LSV] Vectorize loads of vectors by turning it into a larger vector
Use shufflevector to do the subvector extracts. This allows a lot more load merging on AMDGPU and also on NVPTX when <2 x half> is involved.
Differential Revision: https://reviews.llvm.org/D117219
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc2 |
|
| #
c0581f7d |
| 05-Jan-2022 |
David Salinas <[email protected]> |
Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test Q
Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D116686
Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2
show more ...
|
| #
085f0783 |
| 05-Jan-2022 |
Nico Weber <[email protected]> |
Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""
This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62. The change contained many unrelated changes and e.g. restored un
Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""
This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62. The change contained many unrelated changes and e.g. restored unit test failes for the old lld port.
show more ...
|
| #
859ebca7 |
| 23-Dec-2021 |
David Salinas <[email protected]> |
Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU an
Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ibf8e397df94001f248fba609f072088a46abae08
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D115960
Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc1 |
|
| #
da067ed5 |
| 10-Nov-2021 |
Austin Kerbow <[email protected]> |
[AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memor
[AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL' heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
show more ...
|
| #
30b27ecf |
| 19-Nov-2021 |
Jay Foad <[email protected]> |
[AMDGPU] Use new opcode for indexed vgpr reads
Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguous
[AMDGPU] Use new opcode for indexed vgpr reads
Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguously distinguished from regular V_MOV_B32_e32. Previously they were distinguished by looking for extra implicit operands but this is fragile because regular moves sometimes have extra implicit operands too: - either by accident, when instructions end up with duplicate implicit operands (see e.g. D100939) - or by design, when SIInstrInfo::copyPhysReg breaks a multi-dword copy into individual subreg mov instructions and adds implicit operands for the super-register.
The effect of this is that SIInstrInfo::isFoldableCopy can be simplified and identifies more foldable copies. The test diffs show that more immediate 0 values have been folded as inline operands.
SIInstrInfo::isReallyTriviallyReMaterializable could probably be simplified too but that is not part of this patch.
Differential Revision: https://reviews.llvm.org/D114230
show more ...
|
| #
a70bbb5f |
| 11-Nov-2021 |
Jay Foad <[email protected]> |
[AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by adding the high parts *without* carry-in from the low part, and then adding
[AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by adding the high parts *without* carry-in from the low part, and then adding the carry back in later on. Fixing this saves a couple of instructions and makes the code much easier to understand.
Differential Revision: https://reviews.llvm.org/D113679
show more ...
|
| #
ca0c92d6 |
| 13-Oct-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Allow to use a whole register file on gfx90a for VGPRs
In a kernel which does not have calls or AGPR usage we can allocate the whole vector register budget for VGPRs and have no AGPRs as lo
[AMDGPU] Allow to use a whole register file on gfx90a for VGPRs
In a kernel which does not have calls or AGPR usage we can allocate the whole vector register budget for VGPRs and have no AGPRs as long as VGPRs stay addressable (i.e. below 256).
Differential Revision: https://reviews.llvm.org/D111764
show more ...
|
| #
c885857e |
| 12-Oct-2021 |
Jay Foad <[email protected]> |
[AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post
[AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%.
Differential Revision: https://reviews.llvm.org/D111646
show more ...
|
| #
66ce1015 |
| 12-Oct-2021 |
Jay Foad <[email protected]> |
Revert "[AMDGPU] Enable load clustering in the post-RA scheduler"
This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec.
It was committed by accident.
|
| #
66e13c7f |
| 12-Oct-2021 |
Jay Foad <[email protected]> |
[AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post
[AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%.
show more ...
|
|
Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4 |
|
| #
1a332946 |
| 16-Sep-2021 |
alex-t <[email protected]> |
[AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC
Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* b
[AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC
Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* but divergent to V_CMP_*. Sometimes, for the sake of efficiency, SSA subgraphs may be converted to VALU to avoid repeatedly copying data back and forth. Hence we have to be able to sustain the correctness passing the i1 from VALU to SALU context and vice versa.
VALU operations only process the active lanes of the VGPR and ignore inactive ones. Active lanes correspond to 1 bit in the EXEC mask register. SALU represents i1 as just one bit but VALU as 64bits: 0/1 and 0/(0xffffffffffffffff & EXEC) respectively. SALU uses one-bit conditional flag SCC but VALU - VCC that is a pair of 32-bit SGPRs
To expose SCC to the VALU context we need to convert the one-bit boolean value to the appropriate 64bit. To return back to the SALU context we need to do the opposite.
To correctly convert 64bit VALU boolean to either 0 or 1 we need to filter out the bits corresponding to the inactive lanes.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D109900
show more ...
|
|
Revision tags: llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2 |
|
| #
4a36e96c |
| 21-Aug-2021 |
Matt Arsenault <[email protected]> |
RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heu
RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heuristic to use. This was taking the raw number of registers in the class, even though not all of them may be available. AMDGPU heavily relies on dynamically reserved numbers of registers based on user attributes to satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that made me notice this, the live range size is incorrect after the scheduler rearranges instructions, since the instructions don't have the original InstrDist offsets. Additionally, I think it would be more appropriate to use the number of disjointly allocatable registers in the class. For the AMDGPU register tuples, there are a large number of registers in each tuple class, but only a small fraction can actually be allocated at the same time since they all overlap with each other. It seems we do not have a query that corresponds to the number of independently allocatable registers. Relatedly, I'm still debugging some allocation failures where overlapping tuples seem to not be handled correctly.
The test changes are mostly noise. There are a handful of x86 tests that look like regressions with an additional spill, and a handful that now avoid a spill. The worst looking regression is likely test/Thumb2/mve-vld4.ll which introduces a few additional spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll shows a massive improvement by completely eliminating a large number of spills inside a loop.
show more ...
|
| #
3ce1b963 |
| 08-Sep-2021 |
Joe Nash <[email protected]> |
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D1095
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
show more ...
|
| #
640beb38 |
| 30-Aug-2021 |
Michael Liao <[email protected]> |
[amdgpu] Enable selection of `s_cselect_b64`.
Differential Revision: https://reviews.llvm.org/D109159
|
|
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init |
|
| #
ed0f4415 |
| 15-Jul-2021 |
alex-t <[email protected]> |
[AMDGPU] Divergence-driven compare operations instruction selection
Description: This change enables the compare operations to be selected to SALU/VALU form dependent of the SDNode dive
[AMDGPU] Divergence-driven compare operations instruction selection
Description: This change enables the compare operations to be selected to SALU/VALU form dependent of the SDNode divergence flag.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D106079
show more ...
|
| #
d719f1c3 |
| 03-Aug-2021 |
Matt Arsenault <[email protected]> |
AMDGPU: Add alloc priority to global ranges
The requested register class priorities weren't respected globally. Not sure why this is a target option, and not just the expected behavior (recently add
AMDGPU: Add alloc priority to global ranges
The requested register class priorities weren't respected globally. Not sure why this is a target option, and not just the expected behavior (recently added in 1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation failure when many wide tuple spills are introduced. I think this is a workaround since I would not expect the allocation priority to be required, and only a performance hint. The allocator should be smarter about when only a subregister needs to be spilled and restored.
This does regress a couple of degenerate store stress lit tests which shouldn't be too important.
show more ...
|
| #
c2340517 |
| 04-Aug-2021 |
Craig Topper <[email protected]> |
[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.
This allows special constants like to 0 to be recognized. It's also expected by isel patterns if a target had a mulh with imme
[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.
This allows special constants like to 0 to be recognized. It's also expected by isel patterns if a target had a mulh with immediate instructions. The commuting done by tablegen won't commute patterns with immediates since it expects DAGCombine to have done it.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D107486
show more ...
|
| #
381ded34 |
| 28-Jun-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
This is to allow 64 bit constant rematerialization. If a constant is split into two separate moves initializing sub0 and sub1 like now RA cannot
[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
This is to allow 64 bit constant rematerialization. If a constant is split into two separate moves initializing sub0 and sub1 like now RA cannot rematerizalize a 64 bit register.
This gives 10-20% uplift in a set of huge apps heavily using double precession math.
Fixes: SWDEV-292645
Differential Revision: https://reviews.llvm.org/D104874
show more ...
|
|
Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3 |
|
| #
1f9dcd2b |
| 18-Jun-2021 |
Jay Foad <[email protected]> |
[AMDGPU] Update generated checks. NFC.
|
|
Revision tags: llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1 |
|
| #
3f7b7e73 |
| 08-May-2021 |
Brendon Cahoon <[email protected]> |
[AMDGPU] Update SCC defs to VCC when uses are changed to VCC
The FixSGPRCopies pass converts instructions to VALU when removing illegal VGPR to SGPR copies. Instructions that use SCC are changed to
[AMDGPU] Update SCC defs to VCC when uses are changed to VCC
The FixSGPRCopies pass converts instructions to VALU when removing illegal VGPR to SGPR copies. Instructions that use SCC are changed to use VCC instead. When that happens, the pass must also change instructions that define SCC to define VCC.
The pass was not changing the SCC definition when an ADDC is converted due to a input that is a VGPR to SGPR copy. But, the initial ADD insruction, which define SCC, is not converted. This causes a compilation failure due to a use of an undefined physical register.
This patch adds code that inserts the SCC definition in the MoveToVALU worklist when a SCC use is converted to a VCC use.
Differential Revision: https://reviews.llvm.org/D102111
show more ...
|