History log of /llvm-project-15.0.7/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (Results 1 – 25 of 50)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6
# 2e29b013 21-Jun-2022 Alexander Timofeev <[email protected]>

[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.

Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected

[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.

Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected to be selected to SALU form, except those not having one.
VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
and overhead introduced by the data transfer between the VGPRs and SGPRs.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128252

show more ...


Revision tags: llvmorg-14.0.5, llvmorg-14.0.4
# e2926501 16-May-2022 Jay Foad <[email protected]>

[AMDGPU] Aggressively fold immediates in SIShrinkInstructions

Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Dif

[AMDGPU] Aggressively fold immediates in SIShrinkInstructions

Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Differential Revision: https://reviews.llvm.org/D114644

show more ...


# 3eb2281b 16-May-2022 Jay Foad <[email protected]>

[AMDGPU] Aggressively fold immediates in SIFoldOperands

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by ad

[AMDGPU] Aggressively fold immediates in SIFoldOperands

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643

show more ...


# 1ecc3d86 14-May-2022 Simon Pilgrim <[email protected]>

[DAG] Enable ISD::SHL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits

Pulled out of D77804 as its going to be easier to address the regressions individually.

This patch allows

[DAG] Enable ISD::SHL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits

Pulled out of D77804 as its going to be easier to address the regressions individually.

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

The lost RISCV gorc2 fold shouldn't be a problem - instcombine would have already destroyed that pattern - see https://github.com/llvm/llvm-project/issues/50553

Differential Revision: https://reviews.llvm.org/D124839

show more ...


Revision tags: llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# 0776f6e0 13-Jan-2022 Benjamin Kramer <[email protected]>

[LSV] Vectorize loads of vectors by turning it into a larger vector

Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is

[LSV] Vectorize loads of vectors by turning it into a larger vector

Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.

Differential Revision: https://reviews.llvm.org/D117219

show more ...


Revision tags: llvmorg-13.0.1-rc2
# c0581f7d 05-Jan-2022 David Salinas <[email protected]>

Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`."

This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.

That commit caused performance degradtion in Quicksilver test Q

Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`."

This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.

That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup

Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D116686

Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2

show more ...


# 085f0783 05-Jan-2022 Nico Weber <[email protected]>

Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""

This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62.
The change contained many unrelated changes and e.g. restored
un

Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""

This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.

show more ...


# 859ebca7 23-Dec-2021 David Salinas <[email protected]>

Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."

This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.

That commit caused performance degradtion in Quicksilver test QS:sGPU an

Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."

This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.

That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup

Change-Id: Ibf8e397df94001f248fba609f072088a46abae08

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D115960

Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105

show more ...


Revision tags: llvmorg-13.0.1-rc1
# da067ed5 10-Nov-2021 Austin Kerbow <[email protected]>

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memor

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777

show more ...


# 30b27ecf 19-Nov-2021 Jay Foad <[email protected]>

[AMDGPU] Use new opcode for indexed vgpr reads

Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguous

[AMDGPU] Use new opcode for indexed vgpr reads

Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguously
distinguished from regular V_MOV_B32_e32. Previously they
were distinguished by looking for extra implicit operands
but this is fragile because regular moves sometimes have
extra implicit operands too:
- either by accident, when instructions end up with
duplicate implicit operands (see e.g. D100939)
- or by design, when SIInstrInfo::copyPhysReg breaks a
multi-dword copy into individual subreg mov instructions
and adds implicit operands for the super-register.

The effect of this is that SIInstrInfo::isFoldableCopy can
be simplified and identifies more foldable copies. The test
diffs show that more immediate 0 values have been folded as
inline operands.

SIInstrInfo::isReallyTriviallyReMaterializable could
probably be simplified too but that is not part of this
patch.

Differential Revision: https://reviews.llvm.org/D114230

show more ...


# a70bbb5f 11-Nov-2021 Jay Foad <[email protected]>

[AMDGPU] Simplify 64-bit division/remainder expansion

The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding

[AMDGPU] Simplify 64-bit division/remainder expansion

The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679

show more ...


# ca0c92d6 13-Oct-2021 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Allow to use a whole register file on gfx90a for VGPRs

In a kernel which does not have calls or AGPR usage we can allocate
the whole vector register budget for VGPRs and have no AGPRs as
lo

[AMDGPU] Allow to use a whole register file on gfx90a for VGPRs

In a kernel which does not have calls or AGPR usage we can allocate
the whole vector register budget for VGPRs and have no AGPRs as
long as VGPRs stay addressable (i.e. below 256).

Differential Revision: https://reviews.llvm.org/D111764

show more ...


# c885857e 12-Oct-2021 Jay Foad <[email protected]>

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.

Differential Revision: https://reviews.llvm.org/D111646

show more ...


# 66ce1015 12-Oct-2021 Jay Foad <[email protected]>

Revert "[AMDGPU] Enable load clustering in the post-RA scheduler"

This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec.

It was committed by accident.


# 66e13c7f 12-Oct-2021 Jay Foad <[email protected]>

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.

show more ...


Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4
# 1a332946 16-Sep-2021 alex-t <[email protected]>

[AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC

Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* b

[AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC

Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* but divergent to V_CMP_*. Sometimes, for the sake of efficiency, SSA subgraphs may be converted to VALU to avoid repeatedly copying data back and forth. Hence we have to be able to sustain the correctness passing the i1 from VALU to SALU context and vice versa.

VALU operations only process the active lanes of the VGPR and ignore inactive ones.
Active lanes correspond to 1 bit in the EXEC mask register.
SALU represents i1 as just one bit but VALU as 64bits: 0/1 and 0/(0xffffffffffffffff & EXEC) respectively.
SALU uses one-bit conditional flag SCC but VALU - VCC that is a pair of 32-bit SGPRs

To expose SCC to the VALU context we need to convert the one-bit boolean value to the appropriate 64bit.
To return back to the SALU context we need to do the opposite.

To correctly convert 64bit VALU boolean to either 0 or 1 we need to filter out the bits corresponding to the inactive lanes.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D109900

show more ...


Revision tags: llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2
# 4a36e96c 21-Aug-2021 Matt Arsenault <[email protected]>

RegAllocGreedy: Account for reserved registers in num regs heuristic

This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heu

RegAllocGreedy: Account for reserved registers in num regs heuristic

This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.

There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.

The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.

show more ...


# 3ce1b963 08-Sep-2021 Joe Nash <[email protected]>

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D1095

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde

show more ...


# 640beb38 30-Aug-2021 Michael Liao <[email protected]>

[amdgpu] Enable selection of `s_cselect_b64`.

Differential Revision: https://reviews.llvm.org/D109159


Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init
# ed0f4415 15-Jul-2021 alex-t <[email protected]>

[AMDGPU] Divergence-driven compare operations instruction selection

Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode dive

[AMDGPU] Divergence-driven compare operations instruction selection

Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode divergence flag.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D106079

show more ...


# d719f1c3 03-Aug-2021 Matt Arsenault <[email protected]>

AMDGPU: Add alloc priority to global ranges

The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently add

AMDGPU: Add alloc priority to global ranges

The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.

This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.

show more ...


# c2340517 04-Aug-2021 Craig Topper <[email protected]>

[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.

This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with imme

[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.

This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D107486

show more ...


# 381ded34 28-Jun-2021 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants

This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot

[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants

This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Differential Revision: https://reviews.llvm.org/D104874

show more ...


Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3
# 1f9dcd2b 18-Jun-2021 Jay Foad <[email protected]>

[AMDGPU] Update generated checks. NFC.


Revision tags: llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1
# 3f7b7e73 08-May-2021 Brendon Cahoon <[email protected]>

[AMDGPU] Update SCC defs to VCC when uses are changed to VCC

The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to

[AMDGPU] Update SCC defs to VCC when uses are changed to VCC

The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to use VCC instead. When that happens, the pass must
also change instructions that define SCC to define VCC.

The pass was not changing the SCC definition when an ADDC is
converted due to a input that is a VGPR to SGPR copy. But, the
initial ADD insruction, which define SCC, is not converted.
This causes a compilation failure due to a use of an undefined
physical register.

This patch adds code that inserts the SCC definition in the
MoveToVALU worklist when a SCC use is converted to a VCC use.

Differential Revision: https://reviews.llvm.org/D102111

show more ...


12