History log of /llvm-project-15.0.7/llvm/test/CodeGen/AMDGPU/sdiv64.ll (Results 1 – 25 of 46)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6
# 2e29b013 21-Jun-2022 Alexander Timofeev <[email protected]>

[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.

Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected

[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.

Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected to be selected to SALU form, except those not having one.
VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
and overhead introduced by the data transfer between the VGPRs and SGPRs.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128252

show more ...


Revision tags: llvmorg-14.0.5, llvmorg-14.0.4
# e2926501 16-May-2022 Jay Foad <[email protected]>

[AMDGPU] Aggressively fold immediates in SIShrinkInstructions

Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Dif

[AMDGPU] Aggressively fold immediates in SIShrinkInstructions

Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Differential Revision: https://reviews.llvm.org/D114644

show more ...


# 3eb2281b 16-May-2022 Jay Foad <[email protected]>

[AMDGPU] Aggressively fold immediates in SIFoldOperands

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by ad

[AMDGPU] Aggressively fold immediates in SIFoldOperands

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643

show more ...


Revision tags: llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1
# fa630e75 01-Apr-2022 Craig Topper <[email protected]>

[RISCV][AMDGPU][TargetLowering] Special case overflow expansion for (uaddo X, 1).

If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increase

[RISCV][AMDGPU][TargetLowering] Special case overflow expansion for (uaddo X, 1).

If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.

Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.

This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122933

show more ...


Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2
# 565af157 25-Feb-2022 Carl Ritson <[email protected]>

[AMDGPU] Extend pre-emit peephole to redundantly masked VCC

Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate
redundant S_AND operations against EXEC for V_CMP results in VCC.
These occur

[AMDGPU] Extend pre-emit peephole to redundantly masked VCC

Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate
redundant S_AND operations against EXEC for V_CMP results in VCC.
These occur after after register allocation when VCC has been
selected as the comparison destination.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D120202

show more ...


Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# 0776f6e0 13-Jan-2022 Benjamin Kramer <[email protected]>

[LSV] Vectorize loads of vectors by turning it into a larger vector

Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is

[LSV] Vectorize loads of vectors by turning it into a larger vector

Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.

Differential Revision: https://reviews.llvm.org/D117219

show more ...


Revision tags: llvmorg-13.0.1-rc2
# e9179a6a 08-Dec-2021 Sanjay Patel <[email protected]>

[Support] improve known bits analysis for multiply by power-of-2 (1 set bit)

This can be viewed as recognizing that multiply-by-power-of-2 doesn't
have a carry into the top bit of an M-bit * N-bit n

[Support] improve known bits analysis for multiply by power-of-2 (1 set bit)

This can be viewed as recognizing that multiply-by-power-of-2 doesn't
have a carry into the top bit of an M-bit * N-bit number.

Enhancing canonicalization of mul -> select might also handle some of
these if we were ok with increasing instruction count with casts in
some cases.

This doesn't help https://llvm.org/PR49055 , but it's a simpler
pattern that we miss.
Note: "-sccp" already gets these examples using a constant
range analysis.

Differential Revision: https://reviews.llvm.org/D114962

show more ...


Revision tags: llvmorg-13.0.1-rc1
# da067ed5 10-Nov-2021 Austin Kerbow <[email protected]>

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memor

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777

show more ...


# 18f93512 19-Nov-2021 RamNalamothu <[email protected]>

[AMDGPU] Do not generate ELF symbols for the local branch target labels

The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slow

[AMDGPU] Do not generate ELF symbols for the local branch target labels

The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slows down the loader,
and is only used to simplify disassembly.

Use '--symbolize-operands' with llvm-objdump to improve readability of the
branch target operands in disassembly.

Fixes: SWDEV-312223

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D114273

show more ...


# 30b27ecf 19-Nov-2021 Jay Foad <[email protected]>

[AMDGPU] Use new opcode for indexed vgpr reads

Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguous

[AMDGPU] Use new opcode for indexed vgpr reads

Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguously
distinguished from regular V_MOV_B32_e32. Previously they
were distinguished by looking for extra implicit operands
but this is fragile because regular moves sometimes have
extra implicit operands too:
- either by accident, when instructions end up with
duplicate implicit operands (see e.g. D100939)
- or by design, when SIInstrInfo::copyPhysReg breaks a
multi-dword copy into individual subreg mov instructions
and adds implicit operands for the super-register.

The effect of this is that SIInstrInfo::isFoldableCopy can
be simplified and identifies more foldable copies. The test
diffs show that more immediate 0 values have been folded as
inline operands.

SIInstrInfo::isReallyTriviallyReMaterializable could
probably be simplified too but that is not part of this
patch.

Differential Revision: https://reviews.llvm.org/D114230

show more ...


# a70bbb5f 11-Nov-2021 Jay Foad <[email protected]>

[AMDGPU] Simplify 64-bit division/remainder expansion

The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding

[AMDGPU] Simplify 64-bit division/remainder expansion

The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679

show more ...


Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init
# c80d8a8c 23-Jul-2021 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] MachineLICM cannot hoist VALU

MachineLoop::isLoopInvariant() returns false for all VALU
because of the exec use. Check TII::isIgnorableUse() to
allow hoisting.

That unfortunately results i

[AMDGPU] MachineLICM cannot hoist VALU

MachineLoop::isLoopInvariant() returns false for all VALU
because of the exec use. Check TII::isIgnorableUse() to
allow hoisting.

That unfortunately results in higher register consumption
since MachineLICM does not adequately estimate pressure.
Therefor I think it shall only be enabled after D107677 even
though it does not depend on it.

Differential Revision: https://reviews.llvm.org/D107859

show more ...


# c885857e 12-Oct-2021 Jay Foad <[email protected]>

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.

Differential Revision: https://reviews.llvm.org/D111646

show more ...


# 66ce1015 12-Oct-2021 Jay Foad <[email protected]>

Revert "[AMDGPU] Enable load clustering in the post-RA scheduler"

This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec.

It was committed by accident.


# 66e13c7f 12-Oct-2021 Jay Foad <[email protected]>

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post

[AMDGPU] Enable load clustering in the post-RA scheduler

This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.

show more ...


# 4a36e96c 21-Aug-2021 Matt Arsenault <[email protected]>

RegAllocGreedy: Account for reserved registers in num regs heuristic

This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heu

RegAllocGreedy: Account for reserved registers in num regs heuristic

This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.

There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.

The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.

show more ...


# 3ce1b963 08-Sep-2021 Joe Nash <[email protected]>

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D1095

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde

show more ...


# d719f1c3 03-Aug-2021 Matt Arsenault <[email protected]>

AMDGPU: Add alloc priority to global ranges

The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently add

AMDGPU: Add alloc priority to global ranges

The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.

This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.

show more ...


# f7076cfd 05-Aug-2021 Craig Topper <[email protected]>

[DAGCombiner][RISCV][AMDGPU] Call SimplifyDemandedBits at the end of visitMULHU to enable known bits contant folding.

We don't have real demanded bits support for MULHU, but we can
still use the kno

[DAGCombiner][RISCV][AMDGPU] Call SimplifyDemandedBits at the end of visitMULHU to enable known bits contant folding.

We don't have real demanded bits support for MULHU, but we can
still use the known bits based constant folding support at the end
of SimplifyDemandedBits to simplify a MULHU. This helps with cases
where we know the LHS and RHS have enough leading zeros so that
the high multiply result is always 0.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D106471

show more ...


# e6c364a6 05-Aug-2021 Jay Foad <[email protected]>

[AMDGPU][SDag] Better lowering for 64-bit ctlz/cttz

Differential Revision: https://reviews.llvm.org/D107546


# c2340517 04-Aug-2021 Craig Topper <[email protected]>

[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.

This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with imme

[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.

This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D107486

show more ...


# 4a3b0556 09-Jul-2021 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Fix flags of V_MOV_B64_PSEUDO

In particular it was not rematerializable.

Differential Revision: https://reviews.llvm.org/D105724


# 381ded34 28-Jun-2021 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants

This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot

[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants

This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Differential Revision: https://reviews.llvm.org/D104874

show more ...


Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3
# 98f48723 24-Jun-2021 Carl Ritson <[email protected]>

[AMDGPU] Add 224-bit vector types and link 192-bit types to MVTs

Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6

[AMDGPU] Add 224-bit vector types and link 192-bit types to MVTs

Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D104622

show more ...


Revision tags: llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1
# 3f7b7e73 08-May-2021 Brendon Cahoon <[email protected]>

[AMDGPU] Update SCC defs to VCC when uses are changed to VCC

The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to

[AMDGPU] Update SCC defs to VCC when uses are changed to VCC

The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to use VCC instead. When that happens, the pass must
also change instructions that define SCC to define VCC.

The pass was not changing the SCC definition when an ADDC is
converted due to a input that is a VGPR to SGPR copy. But, the
initial ADD insruction, which define SCC, is not converted.
This causes a compilation failure due to a use of an undefined
physical register.

This patch adds code that inserts the SCC definition in the
MoveToVALU worklist when a SCC use is converted to a VCC use.

Differential Revision: https://reviews.llvm.org/D102111

show more ...


12