History log of /llvm-project-15.0.7/llvm/test/CodeGen/AMDGPU/ds-alignment.ll (Results 1 – 13 of 13)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2
# ac94073d 12-Apr-2022 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Refine 64 bit misaligned LDS ops selection

Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64

[AMDGPU] Refine 64 bit misaligned LDS ops selection

Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64 aligned by 8: 3.2 sec
ds_write2_b32 aligned by 8: 3.2 sec
ds_write_b16 * 4 aligned by 8: 7.0 sec
ds_write_b8 * 8 aligned by 8: 13.2 sec
ds_write_b64 aligned by 1: 7.3 sec
ds_write2_b32 aligned by 1: 7.5 sec
ds_write_b16 * 4 aligned by 1: 14.0 sec
ds_write_b8 * 8 aligned by 1: 13.2 sec
ds_write_b64 aligned by 2: 7.3 sec
ds_write2_b32 aligned by 2: 7.5 sec
ds_write_b16 * 4 aligned by 2: 7.1 sec
ds_write_b8 * 8 aligned by 2: 13.3 sec
ds_write_b64 aligned by 4: 4.6 sec
ds_write2_b32 aligned by 4: 3.2 sec
ds_write_b16 * 4 aligned by 4: 7.1 sec
ds_write_b8 * 8 aligned by 4: 13.3 sec
ds_read_b64 aligned by 8: 2.3 sec
ds_read2_b32 aligned by 8: 2.2 sec
ds_read_u16 * 4 aligned by 8: 4.8 sec
ds_read_u8 * 8 aligned by 8: 8.6 sec
ds_read_b64 aligned by 1: 4.4 sec
ds_read2_b32 aligned by 1: 7.3 sec
ds_read_u16 * 4 aligned by 1: 14.0 sec
ds_read_u8 * 8 aligned by 1: 8.7 sec
ds_read_b64 aligned by 2: 4.4 sec
ds_read2_b32 aligned by 2: 7.3 sec
ds_read_u16 * 4 aligned by 2: 4.8 sec
ds_read_u8 * 8 aligned by 2: 8.7 sec
ds_read_b64 aligned by 4: 4.4 sec
ds_read2_b32 aligned by 4: 2.3 sec
ds_read_u16 * 4 aligned by 4: 4.8 sec
ds_read_u8 * 8 aligned by 4: 8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64 aligned by 8: 4.4 sec
ds_write2_b32 aligned by 8: 4.3 sec
ds_write_b16 * 4 aligned by 8: 7.9 sec
ds_write_b8 * 8 aligned by 8: 13.0 sec
ds_write_b64 aligned by 1: 23.2 sec
ds_write2_b32 aligned by 1: 23.1 sec
ds_write_b16 * 4 aligned by 1: 44.0 sec
ds_write_b8 * 8 aligned by 1: 13.0 sec
ds_write_b64 aligned by 2: 23.2 sec
ds_write2_b32 aligned by 2: 23.1 sec
ds_write_b16 * 4 aligned by 2: 7.9 sec
ds_write_b8 * 8 aligned by 2: 13.1 sec
ds_write_b64 aligned by 4: 13.5 sec
ds_write2_b32 aligned by 4: 4.3 sec
ds_write_b16 * 4 aligned by 4: 7.9 sec
ds_write_b8 * 8 aligned by 4: 13.1 sec
ds_read_b64 aligned by 8: 3.5 sec
ds_read2_b32 aligned by 8: 3.4 sec
ds_read_u16 * 4 aligned by 8: 5.3 sec
ds_read_u8 * 8 aligned by 8: 8.5 sec
ds_read_b64 aligned by 1: 13.1 sec
ds_read2_b32 aligned by 1: 22.7 sec
ds_read_u16 * 4 aligned by 1: 43.9 sec
ds_read_u8 * 8 aligned by 1: 7.9 sec
ds_read_b64 aligned by 2: 13.1 sec
ds_read2_b32 aligned by 2: 22.7 sec
ds_read_u16 * 4 aligned by 2: 5.6 sec
ds_read_u8 * 8 aligned by 2: 7.9 sec
ds_read_b64 aligned by 4: 13.1 sec
ds_read2_b32 aligned by 4: 3.4 sec
ds_read_u16 * 4 aligned by 4: 5.6 sec
ds_read_u8 * 8 aligned by 4: 7.9 sec
```

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Differential Revision: https://reviews.llvm.org/D123956

show more ...


Revision tags: llvmorg-14.0.1
# f6462a26 11-Apr-2022 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Split unaligned 4 DWORD DS operations

Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance

[AMDGPU] Split unaligned 4 DWORD DS operations

Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128 aligned by 16: 4.9 sec
ds_write2_b64 aligned by 16: 5.1 sec
ds_write2_b32 * 2 aligned by 16: 5.5 sec
ds_write_b128 aligned by 1: 8.1 sec
ds_write2_b64 aligned by 1: 8.7 sec
ds_write2_b32 * 2 aligned by 1: 14.0 sec
ds_write_b128 aligned by 2: 8.1 sec
ds_write2_b64 aligned by 2: 8.7 sec
ds_write2_b32 * 2 aligned by 2: 14.0 sec
ds_write_b128 aligned by 4: 5.6 sec
ds_write2_b64 aligned by 4: 8.7 sec
ds_write2_b32 * 2 aligned by 4: 5.6 sec
ds_write_b128 aligned by 8: 5.6 sec
ds_write2_b64 aligned by 8: 5.1 sec
ds_write2_b32 * 2 aligned by 8: 5.6 sec
ds_read_b128 aligned by 16: 3.8 sec
ds_read2_b64 aligned by 16: 3.8 sec
ds_read2_b32 * 2 aligned by 16: 4.0 sec
ds_read_b128 aligned by 1: 4.6 sec
ds_read2_b64 aligned by 1: 8.1 sec
ds_read2_b32 * 2 aligned by 1: 14.0 sec
ds_read_b128 aligned by 2: 4.6 sec
ds_read2_b64 aligned by 2: 8.1 sec
ds_read2_b32 * 2 aligned by 2: 14.0 sec
ds_read_b128 aligned by 4: 4.6 sec
ds_read2_b64 aligned by 4: 8.1 sec
ds_read2_b32 * 2 aligned by 4: 4.0 sec
ds_read_b128 aligned by 8: 4.6 sec
ds_read2_b64 aligned by 8: 3.8 sec
ds_read2_b32 * 2 aligned by 8: 4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128 aligned by 16: 6.2 sec
ds_write2_b64 aligned by 16: 7.1 sec
ds_write2_b32 * 2 aligned by 16: 7.6 sec
ds_write_b128 aligned by 1: 24.1 sec
ds_write2_b64 aligned by 1: 25.2 sec
ds_write2_b32 * 2 aligned by 1: 43.7 sec
ds_write_b128 aligned by 2: 24.1 sec
ds_write2_b64 aligned by 2: 25.1 sec
ds_write2_b32 * 2 aligned by 2: 43.7 sec
ds_write_b128 aligned by 4: 14.4 sec
ds_write2_b64 aligned by 4: 25.1 sec
ds_write2_b32 * 2 aligned by 4: 7.6 sec
ds_write_b128 aligned by 8: 14.4 sec
ds_write2_b64 aligned by 8: 7.1 sec
ds_write2_b32 * 2 aligned by 8: 7.6 sec
ds_read_b128 aligned by 16: 6.2 sec
ds_read2_b64 aligned by 16: 6.3 sec
ds_read2_b32 * 2 aligned by 16: 7.5 sec
ds_read_b128 aligned by 1: 12.5 sec
ds_read2_b64 aligned by 1: 24.0 sec
ds_read2_b32 * 2 aligned by 1: 43.6 sec
ds_read_b128 aligned by 2: 12.5 sec
ds_read2_b64 aligned by 2: 24.0 sec
ds_read2_b32 * 2 aligned by 2: 43.6 sec
ds_read_b128 aligned by 4: 12.5 sec
ds_read2_b64 aligned by 4: 24.0 sec
ds_read2_b32 * 2 aligned by 4: 7.5 sec
ds_read_b128 aligned by 8: 12.5 sec
ds_read2_b64 aligned by 8: 6.3 sec
ds_read2_b32 * 2 aligned by 8: 7.5 sec
```

Differential Revision: https://reviews.llvm.org/D123634

show more ...


# 65b8a432 12-Apr-2022 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Update ds-alignment.ll test checks. NFC.


# 3870b360 11-Apr-2022 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Split unaligned 3 DWORD DS operations

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned

[AMDGPU] Split unaligned 3 DWORD DS operations

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96 aligned: 3.4 sec
ds_write_b32 + ds_write_b64 aligned: 4.5 sec
ds_write_b32 * 3 aligned: 4.8 sec
ds_write_b96 misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec
ds_write_b32 * 3 misaligned by 1: 10.0 sec
ds_write_b96 misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec
ds_write_b32 * 3 misaligned by 2: 10.1 sec
ds_write_b96 misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec
ds_write_b32 * 3 misaligned by 4: 4.9 sec
ds_write_b96 misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec
ds_write_b32 * 3 misaligned by 8: 4.9 sec
ds_read_b96 aligned: 3.3 sec
ds_read_b32 + ds_read_b64 aligned: 4.9 sec
ds_read_b32 * 3 aligned: 2.6 sec
ds_read_b96 misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec
ds_read_b32 * 3 misaligned by 1: 10.1 sec
ds_read_b96 misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec
ds_read_b32 * 3 misaligned by 2: 10.1 sec
ds_read_b96 misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec
ds_read_b32 * 3 misaligned by 4: 2.6 sec
ds_read_b96 misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec
ds_read_b32 * 3 misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96 aligned: 4.1 sec
ds_write_b32 + ds_write_b64 aligned: 13.0 sec
ds_write_b32 * 3 aligned: 4.5 sec
ds_write_b96 misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec
ds_write_b32 * 3 misaligned by 1: 31.5 sec
ds_write_b96 misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec
ds_write_b32 * 3 misaligned by 2: 31.5 sec
ds_write_b96 misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec
ds_write_b32 * 3 misaligned by 4: 4.5 sec
ds_write_b96 misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec
ds_write_b32 * 3 misaligned by 8: 4.5 sec
ds_read_b96 aligned: 3.8 sec
ds_read_b32 + ds_read_b64 aligned: 12.8 sec
ds_read_b32 * 3 aligned: 4.4 sec
ds_read_b96 misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec
ds_read_b32 * 3 misaligned by 1: 31.5 sec
ds_read_b96 misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec
ds_read_b32 * 3 misaligned by 2: 31.5 sec
ds_read_b96 misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec
ds_read_b32 * 3 misaligned by 4: 4.5 sec
ds_read_b96 misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec
ds_read_b32 * 3 misaligned by 8: 4.5 sec
```

Fixes: SWDEV-330802

Differential Revision: https://reviews.llvm.org/D123524

show more ...


# e66f0edb 07-Apr-2022 Stanislav Mekhanoshin <[email protected]>

[AMDGPU] Split unaligned LDS access instead of scalarizing

There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://review

[AMDGPU] Split unaligned LDS access instead of scalarizing

There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://reviews.llvm.org/D123330

show more ...


Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# d8b69040 20-Jan-2022 Abinav Puthan Purayil <[email protected]>

[AMDGPU] Set MemoryVT for truncstores in tblgen.

GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.

Di

[AMDGPU] Set MemoryVT for truncstores in tblgen.

GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.

Differential Revision: https://reviews.llvm.org/D117762

show more ...


Revision tags: llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init
# c2229724 26-Jul-2021 Matt Arsenault <[email protected]>

AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting

These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the regi

AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting

These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the register
type). Unaligned accesses should be split as a type of lowering.

This has the effect of improving the code in many cases since now we
produce zextloads instead of separate loads with ands. The load/store
legality rules still seem far more complicated than necessary though.

show more ...


# da067ed5 10-Nov-2021 Austin Kerbow <[email protected]>

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memor

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777

show more ...


# d2e66d7f 06-Sep-2021 Konstantin Schwarz <[email protected]>

[GlobalISel] Add a combine for and(load , mask) -> zextload

This only handles simple masks, not shifted masks, for now.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D109357


# 3ce1b963 08-Sep-2021 Joe Nash <[email protected]>

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D1095

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde

show more ...


Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2
# 31a9659d 07-Jun-2021 Matt Arsenault <[email protected]>

GlobalISel: Avoid use of G_INSERT in insertParts

G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef va

GlobalISel: Avoid use of G_INSERT in insertParts

G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef values (although this can get pretty large).

For the case of load/store narrowing, this is still performing the
load/stores in irregularly sized pieces. It might be cleaner to split
this down into equal sized pieces, and rely on load/store merging to
optimize it.

show more ...


Revision tags: llvmorg-12.0.1-rc1
# ac64995c 08-Apr-2021 hsmahesha <[email protected]>

[AMDGPU] Only use ds_read/write_b128 for alignment >= 16

PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100008


# d5fee599 08-Apr-2021 hsmahesha <[email protected]>

[AMDGPU] Add some exhaustive ds read/write alignment tests

PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100007